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CHAPTER  1 

INTRODUCTION 


1.1.  Motivation  and  Research  Objectives 

Virtual  memory  systems  VM  have  been  around  for  the  past  few  decades  and  continue  to 
provide  cost  effective  memory  management  despite  the  modern  achievements  in  memory  technol¬ 
ogy.  Today  modern  computers  ranging  from  supercomputers  to  supermicrocomputers  and  works¬ 
tations  implement  virtual  memory  [37],  The  great  amount  of  research  in  the  area  of  VM  has  pro¬ 
duced  several  models  of  program  behavior  and  several  memory  management  policies  MMP.  Carr 
in  his  Ph.D.  thesis  [  1 3]  presents  a  survey  of  models  of  program  behavior.  Denning  [20]  cites  two 
forms  of  program  behavior  models:  models  of  programs'  memory  demand  and  models  of  memory 
management  policies.  Batson  and  Madison  [30]  define  another  model  of  program  behavior,  phase 
transition  model,  based  on  locality  characteristics.  However,  the  best  model  of  program  behavior. 
Carr  concludes  [  1 3].  is  the  program  itself.  In  a  simulation  environment,  as  is  the  case  of  this 
study,  a  program  is  represented  by  its  address  reference  string. 

Memory  management  policies,  cited  in  the  literature  or  implemented  in  real  systems,  have 
been  classified  into  two  classes:  the  class  of  variable  allocation,  dynamic,  memory  management 
policies  and  the  class  of  fixed  allocation,  static,  memory  management  policies.  Examples  of 
dynamic  policies  are  the  Working  Set  policy  (\VS)  [lS]  and  its  variation:  the  Page  Fault  Fre¬ 
quency  algorithm  (PFF)  [  1 4]:  and  globally  implemented  policies.  Examples  of  static  policies  are 
Feast  Recently  Used  ('LRU  )  and  First  In  First  Out  (FIFO). 

Dynamic  policies  have  been  shown  to  outperform  static  ones  [  1 0 ] .  16].  However,  they  have 
their  ".v  n  problems.  WS.  for  example,  is  too  expensive  to  implement:  fur’  nermore.  it  is  unable  to 
avon:  heavy  faulting  rate  during  mierlocaliiy  transitions  [23].  The  Damped  Working  Set  tDWS) 
[36]  .as  ntroduced  to  avoid  mterlocaluy  transition  faults.  However.  Graham  [25]  showed  that 
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DWS  outperforms  WS  by  less  than  10%.  The  Sampled  Working  Set  (SWS)  [34]  is  a  cheaper  reali¬ 
zation  of  WS.  but  has  a  poorer  performance  [20].  Ferrari  and  Yih  [23]  combined  SWS  and  DWS 
and  introduced  the  Variable  Sampled  Working  Set  (VSWS).  VSWS  performance  is  no  worse  than 
that  of  WS  [23], 

The  page  fault  frequency  algorithm  is  cheaper  to  implement  [  1 5 ]  but  has  poorer  performance 
than  WS  [25];  also,  it  exhibits  anomalous  behavior  [24].  Also.  WS  exhibits  some  types  of 
anomalies  when  tested  against  numerical  programs  [4],  [S],  Other  types  of  WS  anomalies  are 
discovered  in  a  multiprogramming  environment.  (See  Chapter  2.)  Carr  [l3]  compared  WS  with 
"global"  CLOCK.  The  WS  policy  was  shown  to  be  only  slightly  superior  to  CLOCK.  Carr  com¬ 
bined  the  features  of  WS  and  CLOCK  into  a  new  algorithm  WSclock  which  has  a  similar  to  WS 
performance,  although  cheaper  to  implement. 


■  * 


i 


Based  on  the  survey  of  the  research  published  in  VM  area,  two  observations  could  be  made. 

The  first  is  on  the  nature  of  the  experiments.  Simulation  of  single  programs  is  a  common  charac-  — 
teristic  of  a  vast  majority  of  the  experiments,  regardless  of  the  fact  that  multiprogramming  sys¬ 
tems  are  the  real  VM  environment.  See.  for  example,  the  experiments  in  [3],  [4],  [6],  [S],  [16].  [17], 

** 

[2 1  ].  [25],  [28],  and  [36],  One  can  only  guess  that  researchers  have  assumed  that  results  obtained 
from  simulating  in  a  single  programming  environment  would  not  differ  significantly  when  '•r 
applied  to  multiprogramming  systems.  One  objective  of  this  thesis  is  to  investigate  the  accuracy  •»* 
of  this  claim. 


The  second  observation  is  in  regard  to  memory  management  policies.  A  common  characteris¬ 
tic  of  all  existing  policies,  whether  static  as  is  LRL  or  dynamic  as  is  WS.  or  prefetching  [39]  or 
non-prefetching,  is  that  they  try  to  estimate  program  behavior  at  run  time.  In  other  words,  these 
policies  solve  all  memory  management  related  problems  at  run  time.  Three  memory  management 
related  problems  are:  1)  when  to  bring  a  page  into  memory.  2'  which  page  to  replace  and  3'  how 
much  memory  to  allocate.  In  this  thesis,  this  type  of  MMP  is  referred  to  as  run  time  policies.  An 
alternative  approach  to  run  time  policies  is  to  have  some  or  all  of  memory  management  related 


problems  solved  at  compile  time.  Memory  management  policies  using  this  approach  will  be 
referred  to  as  compiler  directed  policies  (CD).  The  main  objective  of  this  thesis  is  to  construct  and 
develop  a  compiler  directed  policy. 

Run  time  policies  suffer  from  two  major  drawbacks.  First,  the  design  of  these  policies  did 
not  take  into  account  that  program  behavior  varies  from  one  program  category  to  another.  For 
example,  numerical  programs  behave  differently  from  system  programs  [3],  [27],  Also,  data  base 
referencing  has  a  different  behavior  from  other  types  of  applications  [40],  [38].  The  second  draw¬ 
back  results  from  the  fact  that  run  time  policies  do  not  consider  the  interaction  of  programs  in  a 
multiprogramming  system.  Programs  affect  each  other  through  swapping  for  local  policies  and 
through  paging  for  global  policies.  Also,  in  a  multiprogramming  system,  the  amount  of  free 
memory  on  the  system  is  variable;  it  varies  according  to  the  load  on  the  system  and  to  the 
amount  of  memory  occupied  by  each  process  in  the  system. 

In  this  thesis,  a  compiler  directed  (CD)  policy  is  designed  with  three  main  features; 

(1)  It  exploits  source  level  information  at  compile  lime.  This  information  is  passed  to  the 
operating  system  through  memory  directives  and  is  used  to  define  memory  requirements  of  a 
program  during  execution. 

(2)  It  is  designed  to  respond  to  the  changes  in  program  intrinsic  memory  requirements,  and  to 
the  requirements  of  other  programs  running  in  the  system. 

(3)  The  compiler  directed  policy  recognizes  the  difference  in  program  behavior  exhibited  by 
dilferent  program  applications.  It  is  designed  specifically  for  numerical  programs. 

1.2.  Overview  of  This  Work 

This  work  is  concerned  with  designing  a  compiler  directed  policy.  The  performance  ol  CD  is 
evaluated  in  a  multiprogf amming  environment  and  compared  with  WS.  since  all  other  run  time 
policies  either  perform  worse  than  WS  or  nearly  the  same  [  1 3 ].  [  1 5 ].  [20],  [23], 


vTtf  ™  i  'T 


Chapter  2  focuses  on  the  performance  of  WS  in  a  multiprogramming  environment.  The 


working  set  policy  is  shown  to  exhibit  anomaly  types  which  may  not  be  discovered  in  a  unipro¬ 


gramming  environment.  A  multiprogramming  model  is  used  in  Chapter  2  to  evaluate  the  perfor¬ 


mance  of  WS.  The  model  generates  other  results  than  those  needed  for  the  investigation  of 


anomalies.  These  results  are  used  later  in  Chapter  4. 


Chapter  2  demonstrates  how  the  results  obtained  from  simulating  in  a  multiprogramming 


environment  may  differ  from  those  obtained  from  a  single  programming  environment. 


In  Chapter  3  CD.  a  compiler  directed  policy,  is  presented.  CD  uses  three  types  of  directives. 


These  directives  are  used  by  the  operating  system  (OS)  to  define  a  process's  memory  requirements. 


We  develop  algorithms  to  be  used  by  a  preprocessor  at  compile  time  to  generate  memory  direc¬ 


tives.  We  also  present  algorithms  for  processing  a  directive  when  executed  by  the  CPL'. 


Chapter  3  also  deals  with  implementation  issues  of  CD.  In  particular,  a  swapping  strategy  is 


developed  for  CD.  The  strategy  is  based  on  the  amount  of  free  memory  available  on  the  system 


and  the  overcommitment  of  memory  to  one  or  more  processes  in  the  system. 


Subroutine  and  procedure  call  handling  can  cause  a  significant  problem  for  generation  of 


compile  time  directives  and  processing  of  run  time  directives,  a  problem  commonly  encountered 


in  compilation  techniques.  Chapter  3  presents  a  technique  for  solving  such  problems.  Issues 


related  to  the  cost  of  CD.  specially  the  cost  associated  with  executing  memory  directives,  are  dis¬ 


cussed  in  Chapter  3. 


Performance  evaluation  and  measurements  are  presented  in  Chapter  4.  The  performance  of 


CD  is  compared  to  me  performance  of  WS.  Empirical  results  are  gathered  from  a  trace  driven 


simulator  of  a  multip -ogramming  system. 


The  conclusions  drawn  from  this  research  are  presented  in  Chapter  5  together  with  some 


suggestions  for  future  research  in  this  area. 


\V 
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CHAPTER  2 

WORKING  SET  PERFORMANCE  IN  MULTIPROGRAMMING  SYSTEMS 


2.1.  Introduction 

The  Working  Set  policy  (WS)  [  1 S]  is  a  local  variable  memory  management  policy.  WS  is 
described  as  follows.  Let  P  be  the  set  of  all  pages  of  a  program.  Also,  let  a  reference  string  con¬ 
sists  of  a  sequence  of  T  references,  r  (l).r  (2)...r  (t  )...r(T  ).  in  which  r(t)  is  the  segment  that  con¬ 
tains  the  virtual  address  generated  by  a  given  program.  Time  is  measured  in  virtual  time.  At  vir¬ 
tual  time  t  .  the  program's  working  set  W  ( t  .r)  is  the  subset  of  P  which  has  been  referenced  in  the 
previous  T  virtual  time  units,  where  r  is  the  WS  window  size.  The  size  of  the  working  set  vv  (f  ,r) 
is  given  bv  the  number  of  pages  in  the  working  set  at  time  t .  The  average  working  set  size  X  (r)  is 
defined  as 

r 

.t)  (2-j) 

X(t)  =  — — — - 

where  T  is  the  length  of  the  reference  string.  More  definitions  will  be  given  as  we  proceed  in  this 
chapter. 

A  mechanism  equivalent  to  the  one  designed  by  Morris  [32]  is  used  in  this  study  to  compute 
the  working  sets  of  a  program.  A  reference  register  is  associated  with  each  page  frame  which  is  set 
to  zero  each  time  the  page  is  referenced.  At  the  same  time,  the  reference  register  of  every  other 
page  is  incremented  by  one.  In  [32]  the  register  is  incremented  at  regular  time  intervals,  rather 
than  at  every  reference  to  a  virtual  address:  the  value  in  the  register  is  an  approximation  to  the 
amount  of  virtual  time  since  the  last  reference.  In  our  model,  the  value  in  the  register  is  the  exact 
amount  of  virtual  time  s.nce  the  last  reference.  Therefore,  our  model  computes  the  exact  working 
sets  of  a  program.  When  the  value  in  the  register  equals  r,  the  page  can  be  removed  irom  the 
working  set.  The  working  sets  are  computed  by  performing  UW  scans  ot  each  task  at  each  virtual 
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time  unit  t ;  an  approximation  of  the  working  sets  is  achieved  by  performing  WS  scans  at  various 
virtual  time  intervals  [  1 3].  In  a  WS  scan,  each  page  p  in  P  is  examined.  If  the  value  a  in  the  refer¬ 
ence  register  of  p  is  equal  to  or  larger  than  r  (a  ^r).  then  p  is  not  in  the  working  sets:  otherwise. 
p  €  Wit  ,r).  Performing  a  scan  each  time  a  page  is  referenced  is  very  expensive.  However,  it  is  the 
only  way  to  capture  the  real  dynamic  behavior  of  WS.  The  main  concern,  in  this  work,  is  with 
the  performance  of  WS  rather  than  with  the  implementation  cost. 

2.2.  WS  Load  Control 

The  working  set  policy  requires  each  process  to  allocate  enough  memory  to  accommodate  its 
working  sets.  In  a  multiprogramming  system,  however,  the  working  sets  of  a  program  may  grow 
beyond  the  available  free  page  frames.  In  such  a  case,  the  working  set  can  not  be  allocated  in  main 
memory.  Denning  [20]  provides  WS  with  the  following  load  control  policy  to  guarantee  that  the 
working  set  of  a  program  is  allocated: 

The  load  control  maintains  an  uncommitted  frame  pool,  which  is  a  list  of  available  page  frames, 
and  a  count  A'  of  the  pool's  (non-negative)  size.  The  highest  priority  ready  task  may  be  activated 
if  that  task's  working  set  size  iv  satisfies: 

w  <  K  -K o 

where  K ,,  is  a  constant  specifying  the  desired  minimum  on  the  pool.  The  purpose  of  AN>  is  to 
prevent  needless  overhead  of  dealing  with  memory  overflow  shortly  after  a  new  task  is  activated. 
When  a  page  fault  occurs,  the  page  fault  handler  subtracts  1  from  the  count  K  .  ...  !f  K  is  already 
0  the  page  fault  handler  will  first  cause  the  load  control  to  preempt  a  page  from  the  lowest  priori¬ 
ty  active  task:  this  implies  that  the  lowest  priority  active  task  may  not  have  its  working  set  fully 
resident.  A  deactive  decision  may  be  issued  by  the  page  fault  handler  if  the  lowest  r.riority  task 
has  its  resident  set  reduced  to  naught. 

Note  that  WS  load  control  has  two  parameters  r  and  A'0.  If  r  is  small,  the  average  working 
set  size  w  of  each  process  is  small  and  the  multiprogramming  level  is  increased.  A  small  r.  how¬ 
ever.  increases  the  fault  rate  of  each  process  and  can  lead  to  thrashing.  Moreover.  WS.  using  small 
t.  performs  much  worse  than  CD  (as  will  be  discussed  in  Chapter  4).  A  large  r  reduces  the  fault 
rate  but  causes  the  process's  working  set  to  grow  and  depresses  the  multiprogramming  level. 

Selecting  a  value  for  A',,  represents  a  trade-oil  between  maximal  use  of  main  memory  and 
reducing  the  overhead  that  occurs  when  the  system  becomes  overcommitted.  If  A\.  is  very  large, 
the  multiprogramming  level  is  depressed.  If  A',,  is  very  small,  a  swapping  w  ill  be  required  each 
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time  the  working  set  expands  beyond  the  free  pool  size.  In  this  thesis.  K 0  =0.  However,  a  free 
page  pool  is  dynamically  created  when  the  working  set  of  a  process  grows  beyond  the  free  pool 
size.  In  this  case  the  resident  set  of  a  "low  priority  process"  is  turned  into  the  free  pool,  as  a  result 
of  swapping.  Swapping  the  resident  set  of  a  process  is  a  modification  to  Denning  s  suggestion  to 
preempt  a  page  from  the  "lowest  priority  active  task."  Carr  [  1 3]  argued  that  preempting  the  pages 
of  the  lowest  priority  task,  one  by  one.  “would  appear  to  be  a  mistake,  since  the  lowest  priority 
active  process  will  be  forced  to  execute  with  a  restricted  resident  set  and  will  fault  often  and  gen¬ 
erates  a  great  deal  of  paging  1//0  without  making  much  progress."  In  the  next  sub-section  we  dis¬ 
cuss  the  swapping  policy  used  in  our  model. 

2.3.  Swapping  Strategies 

Swapping  is  the  deactivation  of  a  process  that  occurs  when  load  control  detects  overcommit¬ 
ment  and  directs  a  reduction  in  the  multiprogramming  level.  A  process  that  causes  memory  over¬ 
commitment  is  called  the  swapping  process.  A  process  whose  resident  set  is  preempted  is  called 
the  swapped  process.  The  mechanism  which  handles  swapping  is  called  the  swapping  mechanism 
(SM).  SM  has  the  following  functions:  find  a  candidate  process  for  swapping  and  preempt  its 
resident  set.  In  [20].  a  process  to  be  swapped  out  is  the  lowest  priority  process  in  the  system. 
While  this  could  be  the  proper  choice  from  the  policy  standpoint,  it  is  not  necessarily  the  best 
from  a  performance  point  of  view.  Carr  [  1 3 ]  suggested  four  policies  to  select  a  candidate  process 
to  swap  out  of  memory.  A  swapped  out  process  can  be  the  faulting  pr-.icess .  the  last  process 
activated .  the  smallest  process,  or  the  largest  process.  The  usefulness  of  any  of  these  policies 
depends  on  the  optimizing  criterion  under  consideration  and  the  immediate  goal  to  be  achieved.  In 
our  model,  it  is  assumed  that  all  the  processes  in  the  system  are  of  equal  priority.  Therefore. 
Denning  s  suggestion  of  the  lowest  priority  process  is  not  practical  for  our  model.  Also,  we  argue 
that  the  faulting  process  should  not  be  swapped  out  since  it  has  just  invoked  SM.  W'nen  it  is  reac¬ 
tivated.  it  may  have  to  invoke  SM  again,  and  thus  continue  to  be  blocked.  It  is  very  likely  that 
the  last  process  activated  has  suffered  a  swapping  just  before  it  has  been  deactivated:  essentia llv.  a 
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discrimination  may  occur  against  one  of  the  processes  in  the  system.  The  smallest  process  policy 
discriminates  against  small  processes,  whereas  the  largest  process  policy  discriminates  against 
large  processes. 

We  introduce  a  new  swapping  policy  based  on  treating  every  process  in  the  system  with 
equal  priority.  No  process  should  be  swapped  out  more  than  one  time  in  a  row.  A  process  that 
has  been  swapped  out  can  not  be  swapped  out  again  until  all  the  processes  in  the  system  have 
experienced  the  pain  of  being  swapped  out.  In  a  system  with  N  processes,  a  process  swapped  out 
at  time  t  may  become  a  candidate  for  swapping  only  after  N  swapping  operations.  Note  that  N 
may  change  its  value  if  a  new  process  is  activated  or  a  process  completes  execution  and  leaves  the 
system.  One  way  of  implementing  this  policy  is  to  use  a  CLOCK-  like  mechanism.  All  the 
processes  in  the  system  are  assumed  to  be  arranged  about  the  circumference  of  a  circle.  The 
CLOCK  pointer  (or  "hand")  points  at  the  last  process  swapped  out  by  SM.  and  is  advanced  "clock¬ 
wise"  when  SM  is  invoked  to  find  the  next  candidate  for  swapping. 

The  resident  set  of  a  swapped  out  process  is  preempted  by  setting  the  value  in  the  reference 
register  ot  each  page  equal  to  the  value  of  t.  The  size  of  the  preempted  resident  set  is  added  to  the 
free  page  frames. 

Another  major  issue  of  swapping  is  how  to  reclaim  the  working  set  of  a  swapped  out  pro¬ 
cess.  There  are  two  methods  for  a  swapped  out  process  to  reclaim  its  working  set.  Demand  pacing 
loads  a  page  only  when  that  page  is  referenced,  whether  it  was  or  it  was  not  a  member  of  the 
process's  previous  working  set.  Prepaging  loads  a  collection  of  pages  (the  prepage  set)  when  the 
process  is  activated.  The  prepage  set.  in  this  context,  is  a  process's  working  set  or  ns  resident  set 
when  the  process  was  swapped  out.  The  main  advantage  of  prepaging  is  to  reduce  page  fault  inter¬ 
rupts.  However,  the  working  set  of  a  process  has  to  be  carefully  arranged  in  auxiliary  memory 
slots  when  the  process  is  swapped  out.  Although  prepaging  has  intuitive  appeal,  many  systems 
avoid  using  prepaging  simply  because  of  its  added  complexity.  From  the  performance  standpoint, 
prepaging  the  entire  working  set  has  the  same  1.0  effect  of  demand  paging  each  page  ol  the  work- 
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ing  set.  Prepaging  eliminates  some  page  fault  interrupts,  and  possibly,  if  the  working  set  pages 
are  sequentially  stored  on  disk,  reduces  the  latency  seek  time  for  all  but  the  first  page.  There  are 
other  disadvantages  for  prepaging  cited  in  [  1 3].  Above  all.  the  working  set  of  a  process  is  not 
necessarily  the  same  when  deactivated  and  later  when  reactivated.  It  is  very  likely  that  a  process 
may  prepage  some  pages  which  might  not  be  referenced  in  the  future.  Besides  wasting  memory, 
prepaging  may  result  in  extra  paging  and  wasting  paging  I/O  capacity.  At  any  rate,  from  a  perfor¬ 
mance  point  of  view,  prepaging  is  treated  as  a  regular  page  fault  with  a  smaller  service  time.  Page 
fault  service  time  includes  page  fault  interrupt  as  well  as  latency  seek  time.  The  model  used  in 
this  study  implements  the  demand  paging  mechanism. 

2.4.  Previous  Work 

Since  the  early  1970‘s  many  research  studies  have  investigated  the  performance  of  WS.  For 
bibliography  and  empirical  results  reported  on  WS's  performance  see  the  paper  written  by  Den¬ 
ning  [20].  and  Abusufah  and  Malkawi  [3],  [8],  Denning  summarized  the  results  of  research  con¬ 
ducted  on  WS  [20]  and  drew  several  important  conclusions.  In  1972  Chu  and  Opderback  observed 
that  WS  generates  lower  space  time  cost  than  the  least  space  time  generable  on  the  LRU  policy 
[  1 4].  A  similar  conclusion  could  be  derived  from  the  experiments  performed  by  Graham  and  Den¬ 
ning  [26],  Denning  concludes  that  "the  evidence  available  suggests  that  global  CLOCK  and  global 
LRU  do  not  perform  as  w  ell  as  WS."  (The  word  global  is  added  since  global  policies  are  discussed 
but  the  statement  did  not  explicitly  mention  the  word  global.)  It  is  interesting  to  note,  however, 
that  in  the  same  section  of  'he  paper  Denning  refers  to  the  evidence  obtained  from  Graham's  Ph.D 
thesis:  "Graham's  data  shows  that  LRU  is  normally  significantly  worse  than  WS  w'hen  applied  to 
single  programs"  [25],  Also.  Denning  notes  that  "there  is,  unfortunately,  little  published  perfor¬ 
mance  data  on  the  CLOCK  and  global  LRU."  Evidently.  WS  had  not  been  compared  with  global 
LRU  and  global  CLOCK  at  the  time  of  the  conclusions  made  in  [20], 

■  >ne  can  easily  argue  with  the  above  conclusions  regarding  the  performance  of  WS.  It  :s 
only  natural  that  a  dynamic  local  policy,  when  properly  "tuned."  performs  better  than  a  static 
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(local)  one.  However,  the  performance  of  a  globally  implemented  static  policy  may  or  may  not  be 
worse  than  that  of  WS.  Such  a  performance  can  be  obtained  only  from  measurements  of  a  mul- 
tiprogrammed  system.  The  only  relevant  measurements  cited  in  [20]  were  those  performed  by 
Simon  [35],  However.  Simon  compared  WS  and  V\1I.\  [33]  in  a  queuing  network  model.  His 
thesis  did  not  address  the  problem  of  comparing  global  and  local  dynamic  policies.  On  the  other 
hand.  Carr  [  1 3]  simulated  global  CLOCK  and  WS  policies  in  a  multiprogramming  environment. 
Carr  concludes  that  little  difference  between  local  policies  (e.g..  WS)  and  global  policies  (e.g.. 
CLOCK)  has  been  observed  in  a  representative  system".  Carr  introduced  a  new  policy.  WSclock, 
which  performs  as  well  as  WS.  even  though  "it  is  much  simpler  than  any  of  the  other  WS  algo¬ 
rithms"  [13]. 

Compared  to  the  page  fault  frequency  policy.  PFF  [  14].  Denning  concludes  that: 

W  S  and  PFF,  when  properly  "tuned"  by  a  proper  choice  of  their  control  parameters,  perform  near¬ 
ly  the  same  and  considerably  better  than  LRL';  WS  has  a  slight  tendency  to  produce  lower  space 
time  minima  than  PfF.  However,  PFF  may  display  anomalies  for  certain  programs.  Moreover,  the 
performance  of  PFF  is  much  more  sensitive  to  the  choice  of  control  parameter  than  is  the  perfor¬ 
mance  of  WS. 

However.  Abusufah.  et  al.  [4],  [8]  showed  that  WS  exhibits  certain  types  of  anomalies  for  a  cer¬ 
tain  type  of  programs.  Out  of  30  numerical  programs  studied  in  [3],  all  but  one  displayed  two 
types  of  anomalies:  parameter-real  memory  and  fault  rate-real  memory  anomalies  [24],  .Moreover. 
W S  displayed  great  sensitivity  to  the  choice  of  control  parameter,  t  [8],  Denning  concluded  from 
the  empirical  studies  conducted  by  Graham  [25]  and  Simon  [35]  that  "the  WS  policy  can  be  run 
with  a  single  global  r-value  and  deliver  throughput  tvpicallv  no  worse  than  10  percent  from 
optimum.  Alanko.  Haikala.  and  Kutvonen  [b]  concluded  from  their  empirical  results  that  "it  is 
impossible  to  find  a  single  global  r-\ alue  that  achieves  the  results  reported  in  [20]."  The  work 
done  by  Abusufah.  Lee.  Malkawi,  Yeu  [3],  [S]  shows  thai  at  least  6  values  of  r  are  needed  to  run 
a  set  of  1  7  programs  within  10  percent  from  'ptimum.  ft  is  worthwhile  to  mention  that  the  sen¬ 
sitivity  of  WS  or  PFF  to  the  choice  of  control  parameter  can  be  displayed  only  in  a  multipro- 
grammed  system.  Al!  empirical  results  were  generated  t  rom  individual  reference  traces,  assum- 
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ing  a  uniprogrammed  system,  and  ignoring  any  interaction  between  the  programs.  For  this  reason 
we  believe  that  the  sensitivity  of  WS  to  the  choice  of  r  has  not  been  fully  investigated,  although 
the  contradiction  in  the  reported  results  in  literature  makes  previous  conclusions  about  finding  a 
single  r-value  optimistic. 

Based  on  comparing  WS  and  VMIN’  [33]  by  Simon  [35],  Denning  concludes  that  "no  one  is 
likely  to  find  a  policy  that  improves  significantly  over  the  performance  of  the  tuned  WS  policy." 
Such  a  conclusion  is  motivated  by  the  fact  that  VMI.N  is  an  optimum  unimplementable  policy. 
Carr  [  1 3]  argued  that  Simon's  work  did  not  provide  enough  evidence  to  support  such  a  conclusion. 
Simon  estimated  that  VMIN  achieves  lower  space  time  cost  than  WS  by  less  than  5  percent  on  the 
average.  It  is  interesting  to  note,  however,  that  VMIN  is  the  optimal  policy  for  finding  the 
minimum  page  fault  rate:  VMIN  does  not  find  a  minimal  space  time  cost.  Therefore,  comparing 
WS  with  VMI.N  can  not  serve  as  "compelling  evidence"  for  the  WS  optimality.  The  optimal  policy 
is  DMIN  [  10].  In  [l  l],  DMIN  showed  significant  improvement  over  both  WS  and  VMIN. 

A  common  characteristic  of  almost  all  research  studies  on  the  WS  performance  is  that  they 
use  individual  virtual  address  traces.  Even  when  WS  is  compared  to  a  global  policy  (global  LRU), 
individual  programs  are  used  in  the  experiments:  "Graham's  data  shows  that  global  LRU  is  nor¬ 
mally  significantly  worse  than  WS  when  applied  to  single  programs"  [20],  Denning  stales  that: 

The  WS  policy  serves  as  a  dynamic  estimator  of  the  segments  (pages)  currently  needed  by  a  pro¬ 
gram.  The  WS  is  defined  in  a  program's  virtual  time,  independently  of  other  programs:  thus,  there 
is  no  danger  that  the  load  on  the  system  can  influence  the  measurement... 

While  it  is  true  that  WS  defined  in  a  program's  virtual  lime  is  not  affected  by  load  on  the  system, 
in  a  real  system  the  resident  set  of  pages  of  a  program  does  indeed  change  according  to  system 
load.  To  maintain  the  resident  set  equal  to  the  working  set  may  incur  overhead  in  terms  of  more 
page  transfers  not  reflected  in  the  program’s  intrinsic  demand.  Thus  it  is  clear  that  the  load  on  the 
^.stern  does  affect  the  measurement  of  paging  activities  of  a  program.  The  paging  activities  of  the 
WS  in  a  multiprogramming  environment  were  empirically  measured. 
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The  experimental  model  is  described  in  the  next  section.  Empirical  results  on  the  WS 
behavior  in  multiprogramming  systems  are  reported  in  the  following  sections. 

2.5.  Multiprogramming  Model 

In  this  thesis  a  simple  model  is  used  to  evaluate  the  performance  of  WS  in  a  multiprogram¬ 
ming  system.  The  model  is  shown  in  Figure  2-1.  The  same  model  is  used  for  evaluating  CD 
(described  in  the  next  chapter);  specific  features  related  to  CD  will  be  discussed  in  Chapter  4.  The 
Process  Queue  (PQ)  is  implemented  as  a  First  in  First  Out  (FIFO)  and  used  to  hold  the  active 
processes.  Each  process  is  represented  by  its  virtual  address  trace.  An  address  trace  consists  of 
references  to  array  elements  only.  Initially,  all  array  data  elements  are  stored  in  the  virtual 
storage.  All  instructions,  constant,  and  simple  variables  are  assumed  to  be  resident  in  the  main 
memory.  The  reason  behind  this  assumption  is  that  references  to  arrays  dominate  the  referencing 
behavior  of  numerical  programs  [4],  [30],  Moreover,  the  virtual  size  of  the  storage  containing 
instructions,  constants,  and  variables  is  usually  much  smaller  than  that  used  for  array  structures. 
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Figure  2-1  Multiprogramming  model 
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Therefore,  it  would  be  reasonable  to  have  the  code  of  a  program  locked  in  main  memory  during  the 
execution  of  that  program;  in  case  of  a  structured  program,  the  code  of  a  subprogram  should  be 
locked  during  the  execution  of  that  subprogram. 

PQ  serves  as  the  input  to  the  system  which  consists  of  the  CPU  and  the  main  memory.  The 
main  memory  is  organi2ed  into  a  set  of  blocks  of  equal  sizes  (pages).  Similarly,  the  virtual  storage 
is  divided  into  pages  of  the  same  size.  The  maximum  memory  available  on  the  system.  0,  is  used  as 
a  system  variable.  A  list  of  unoccupied  page  frames  in  main  memory  (free  pool)  is  maintained.  The 
summation  of  the  working  sets  of  all  programs  is  given  by  0  minus  the  free  pool  size  (p).  The  main 
memory  is  initially  empty.  Pages  of  a  program  are  paged  into  main  memory  on  demand.  The  work¬ 
ing  set  of  a  program  is  allowed  to  grow  indefinitely  into  the  free  pool  as  long  as  the  free  pool  size  is 
larger  than  zero.  If  the  free  pool  becomes  empty,  a  swapping  process  is  invoked  and  the  working  set 
of  a  process  is  removed  from  main  memory.  The  pages  occupied  by  a  swapped  out  process  are 
turned  into  the  free  pool.  The  swapping  mechanism  is  discussed  in  the  previous  section. 

A  round  robin  scheduling  strategy  is  used  to  schedule  the  control  of  the  CPU  by  the  multiple 
processes.  A  process,  in  control,  relinquishes  the  CPU  in  one  of  three  cases:  time  out  interrupt,  page 
fault  occurrence,  or  program  completion. 

A  time  slice  is  used  as  a  system  variable  in  the  model  to  control  the  time  out  interrupt.  Upon 
generating  a  time  out  interrupt,  the  process  controlling  the  CPU  is  removed  from  the  system  and 
entered  at  the  tail  of  the  PQ.  However,  the  interrupted  process’s  working  set  is  not  removed  from 
mem  memory. 

When  a  page  fault  occurs  the  process  in  control  leaves  the  CPU  and  another  process  from  the 
PQ  gain-;  control.  The  page  fault  is  serviced  by  the  page  fault  service  device.  The  faulting  process  is 
dela;.  eo  by  a  fault  service  delay  element  until  the  page  fault  service  is  completed,  beiore  it  is  ted 
back  nto  the  PQ.  Page  fault  service  time  L  consists  of  the  interrupt  handling  time,  the  time  spent 
;n  search. ng  for  the  addressed  page  in  the  virtual  storage,  the  transfer  time  of  a  page  tr^m  disk  to 
main  memorv  .  and  the  time  tor  allocating  a  page  frame.  In  this  thesis  we  use  a  value  ot  L  =2000 
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time  units;  each  time  unit  is  one  memory  reference.  The  paging  device  is  the  only  I/O  device  used  in 
the  system;  this  consideration  further  simplifies  the  model.  In  other  words,  the  programs  are 
assumed  to  be  executing  in  a  CPU  bound  phase.  Such  assumption  is  valid  for  programs  which  con¬ 
sume  most  of  the  input  data  at  the  beginning  of  execution  and  generate  the  output  data  at  the  end 
of  execution.  The  programs  used  in  our  experiments  comply  with  such  behavior. 

A  process  leaves  the  system  after  all  of  its  virtual  address  trace  has  been  processed.  Upon 
completion  of  a  process’s  execution,  the  necessary  statistics  are  collected.  These  statistics  include 
process  specific  and  overall  system  statistics.  The  system  parameters  are; 

(1)  The  maximum  available  physical  memory  on  the  system.  0.  Very  small  values  of  9  are  used 
for  theoretical  purposes.  For  example,  0=5  pages  is  clearly  impractical  choice  of  the  main 
memory  size.  However,  it  is  used  to  capture  the  behavior  of  WS  in  small  memory  environ¬ 
ment.  characterized  bv  heavy  swapping  activity.  On  the  other  hand,  using  a  very  large  0  may 
leads  to  a  case  similar  to  uniprogramming  environment  where  the  working  set  of  a  program 
can  grow  indefinitely  and  no  swapping  takes  place  at  all.  A  wide  range  of  0  values  is  used  in 
order  to  evaluate  the  dependence  of  WS  behavior  on  the  available  memory  space.  A  large 
value  of  0  is  interpreted  in  the  context  that  the  resident  set  of  any  program  can  grow  to  its 
maximum  limit  assuming  that  a  program  is  running  alone  in  the  system. 

(2)  The  WS  parameter  (the  window  size  r).  It  is  difficult  to  find  an  optimal  r  for  any  program 
without  empirical  investigation.  Therefore,  we  vary  r  from  r=l  to  r=R.  where  R  is  the  refer¬ 
ence  string  length  of  the  largest  program  trace  in  the  system.  The  w'indow  size  is  incremented 
by  5  from  r=l  to  7=1000:  then  r  is  incremented  by  100  from  r=1000  to  7=10000:  beyond  this 
value,  an  increment  of  1000  is  tsed.  Such  choices  of  t  are  used  to  capture  the  behavior  ol 

in  great  accuracy.  For  small  ’.alues  of  r.  the  WS  characteristics  change  rapidly  depending  on 
the  intrinsic  program  behavior.  In  numerical  programs  the  changes  in  locality  structures  are 
abrupt.  The  i if e  time  curves  obtained  from  numerical  programs  exhibit  a  step-uke  1  unction 
behavior  [S).  Therefore,  a  cere  small  increment  m  the  vaiue  ol  *  may  result  in  a  drastic 
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change  in  the  characteristics  of  program  behavior  under  WS.  See.  for  example,  the  life  time 
curves  reported  in  [7].  [S]. 

For  each  program  in  the  system,  we  find  an  optimal  r  depending  on  the  optimizing  criterion. 
For  example,  we  find  the  values  of  r  for  which  the  fault  rate  is  minimum,  the  space  time  cost 
is  minimum,  and  the  throughput  is  maximum.  We  also  find  global  values  of  r  for  which  the 
system  page  faults  and  space  time  cost  are  minimum,  and  the  throughput  is  maximum. 

(3)  The  number  of  processes  running  simultaneously  in  the  system.  This  number  reflects  the 
maximum  multiprogramming  level.  MPL.  The  values  of  MPL  used  in  this  thesis  are  3.  4.  5. 
and  10.  However,  only  5  programs  are  traced;  the  characteristics  of  these  programs  are  found 
in  Table  2-1.  MPL=lO  is  obtained  by  running  two  copies  of  the  same  program  at  the  same 
time. 

(4)  The  context  switch  (CS).  CS  is  used  to  control  the  time  out  interrupt.  In  our  model,  we  use  a 
large  value  of  CS  to  reduce  the  dependence  of  the  results  on  the  time  out  interrupts.  CS-10O0 
is  much  larger  than  the  maximum  possible  life  time  between  successive  page  faults  for  any  of 
the  programs;  Averaging  over  all  the  programs  in  the  system,  the  maximum  life  time  is  350 
time  units.  However,  a  smaller  value.  CS=100.  is  used  to  demonstrate  the  effect  of  CS  on  the 
paging  behavior. 

Process  specific  measures  used  in  this  chapter  are: 


Table  2- 1 

Program  characteristics 


Program 

#  Statements  , 

#  DO  Slat. 

j  #  Arravs  ! 

i  '  i 

#  Array  References 

#  Pages 

MAI\ 

163 

16 

!  7 

79.325 

7.S 

IIFI.D 

76  ; 

9 

24 

10.S23 

60 

IMT 

53  | 

14 

35 

10,745 

1  'J 

CONDUCT 

IS 

21 

-  .  - 

2*M 

HWSCRT 

135 

:s 

7 

22. '21 

'6 
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( 1 )  The  average  virtual  resident  set  size  w(0,r).  For  each  value  of  r  and  9.  w  is  found  by  finding 


the  average  of  the  working  set  size  of  a  program  over  its  virtual  execution  time.  R.  The  work¬ 


ing  set  of  a  program  is  computed  during  each  memory  reference  to  the  virtual  space  of  the 


program.  A  page  is  considered  in  the  working  set  if  the  value  in  its  reference  register  (a)  is 


less  than  t.  The  value  in  the  reference  register  is  incremented  during  each  reference.  All  pages 


with  a^r  belong  to  the  free  pool.  In  a  uniprogramming  system,  the  working  set  of  a  program 


can  change  only  when  the  program  is  executing.  In  a  multiprogramming  system,  the  working 


set  of  a  program  is  likely  to  be  affected  by  other  running  programs.  In  systems  using  global 


policies,  a  running  program's  fault  may  result  in  replacing  a  page  from  another  program's 


working  set;  thus,  programs  interact  through  paging.  WS  restricts  paging  activity  to  the 


program's  own  working  set  and  to  the  free  memory  pool.  Therefore,  it  seems  that  the  working 


set  of  a  program  is  purely  intrinsic  to  the  program  behavior.  We  have  discussed  that  swapping 


activity  may.  as  well,  be  a  means  of  interaction  w-here  the  resident  set  of  a  program  is  affected 


by  another  program's  paging  activity.  A  swapped  out  process 'loses  its  entire  working  set  in 


one  swapping  operation,  or  it  may  lose  its  working  set  pages,  one  bv  one.  in  several  successive 


swapping  operations  if  the  model  suggested  by  Denning  [20]  is  to  be  used.  In  the  previous  sec¬ 


tion  we  discussed  two  methods  for  claiming  the  resident  set  of  a  process  that  has  oeen 


swapped  out  of  main  memory.  It  was  argued  that  demand  paging  is  less  complicated  lhar 


prepaging.  Prepaging  preserves  the  inclusion  property  of  the  WS;  namely,  that  wtr^Cvn'i). 


where  t,<7:.  The  inclusion  property  may  be  violated  if  demand  paging  is  used.  (Our  model 


implements  demand  paging  for  the  reasons  discussed  in  the  previous  section.  Also,  with 


demand  paging  we  w  ill  be  able  to  investigate  the  claim  that  "the  WS  serves  as  a  dynamic  esti¬ 


mator  of  the  segments  (  pages)  currently  needed  by  a  program"  [20], 


2  1  The  page  fault  rate  F(0.r).  The  fault  rate  of  a  process  is  updated  every  time  a  reference  to  a 


nonresident  page  is  made.  The  fault  rate  of  a  process  depends  on  the  intrinsic  behavior  of  ihe 


're*. ess  and  on  the  interact!.. n  of  the  multiprogramming  mix  through  swapping.  Whether  the 


demand  paging  or  presaging  method  is  ..seu.  the  working  set  of  a  swapped  out  process  has  tc 
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be  faulted  back  into  main  memory-  In  prepaging  .  one  operation  initiates  an  I/O  for  the  entire 
working  set:  whereas  in  demand  paging,  a  page  is  faulted  only  when  a  reference  is  made  to 
that  page. 

(3)  The  swapping  rate  S(0.r).  S  is  the  number  of  process’s  pages  that  get  swapped  out  of  main 
memory  on  the  request  of  another  process’s  growing  working  set.  Swapping  does  not.  neces¬ 
sarily.  involve  I/O  operations.  The  working  set  of  a  process  needs  to  be  written  back  to  the 
virtual  storage  only  if  the  pages  have  been  updated  I dirty  pages).  However,  in  this  thesis  we 
simplify  the  model  by  considering  only  clean  pages.  Therefore,  the  cost  of  swapping  is  associ¬ 
ated  w'ith  the  swap  interrupt,  the  search  for  a  swapped  out  process,  and  the  time  for  setting 
the  values  in  the  reference  registers  of  the  members  of  the  working  set  of  the  swapped  out 
process. 

The  overall  system  statistics  include  the  system  page  fault  rate.  Fs>s(9.t)  and  the  system  average 
virtual  memory  V,vt  (0.r).  and  Vsys  are  given  as  the  sum  of  the  fault  rates  and  the  average 
virtual  memory,  respectively,  of  the  individual  processes. 


V' 
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2.6.  WS  Anomalies  in  Multiprogramming  Systems 

In  this  section  we  report  empirical  results  on  WS  anomalies  in  multiprogrammed  systems. 
Five  tvpes  of  anomalies  are  defined  by  Franklin,  Graham  and  Gupta  in  [24].  Empirical  results, 
reported  previously,  on  WS  anomalies  have  been  generated  from  simulation  ol  individual  reference 
traces  [4].  [S]  and  from  the  analysis  of  individual  reference  strings  [24],  These  results  show  that 
WS  exhibits  two  tvpes  of  anomalies:  namely,  the  real  memory-fault  rate  IM-F  I  and  parameter-real 
memory  ir-M)  anomalies.  M —F  anomaly  exists  if 

A/(r,)<A/(r:)  and  F(rx)<F  t2) 

for  some  values  of  the  WS  parameter  and  r2-  And  t—M  anomaly  exists  if.  for  some  r !  and  t2. 


$ 


r,  >  r2  and  M  (r  j )  <  M  (r2) 

Both  tvpes  of  anomalies  M  —F  and  t—M  do  not  violate  the  conditions  of  the  generalized  inclusion 
property  proposed  by  Franklin  et  al.  [24],  The  other  anomaly  l..pes  ire:  parameter- fault  rate  It-J ■  1 


anomaly,  parameter-virtual  memory  ( r-V )  anomaly,  and  virtual  memory-fault  rate  IV-F)  anomaly. 
The  WS  policy  can  not  exhibit  any  of  these  three  types  of  anomalies  when  tested  against  individual 
programs  in  a  uniprogramming  environment.  However,  we  will  show  that  this  is  not  the  case  in  a 
multiprogramming  system.  We  will  also  define  two  more  anomaly  types  specific  for  multipro¬ 
gramming  systems. 

We  do  not  report  in  this  thesis  the  results  on  r— M  and  M  —F  anomalies  since  they  have  been 
empirically  reported  in  the  literature  [4]  [8],  [24].  Besides,  they  have  little  influence  on  the  control¬ 
lability  of  the  policy  [24],  The  new  anomaly  types  discussed  in  this  section  are  the  system  memory- 
fault  rate  anomaly  19-F)  and  the  system  memory-virtual  memory  anomaly  (9-V).  These  and  the  other 
anomalies  are  defined  and  discussed  in  details  in  the  following  subsections. 

2.6.1.  Parameter-fault  rate  anomalies 

A  parameter-fault  rate  anomaly  (r-F)  in  a  multiprogramming  system  exists,  for  some  r,.  r2 
and  0.  if 

r,  >  t,  and  F(r,.0)  >  F(r:.0). 

Parameter-fault  rate  anomalies,  exhibited  by  individual  processes  are  shown  in  Figures  2-2a  -  2-2e 
for  program  MAIN.  FIELD.  INIT.  CONDUCT,  and  HWSCRT  respectively.  Each  figure  contains 
several  plots  for  different  values  of  9.  We  have  used  four  different  values  of  0:  50.  100.  150  and 
200  pages.  Smaller  values  of  0  represent  the  case  of  a  high  memory  contention,  especially  for 
higher  degrees  of  MPL.  In  each  of  these  figures  we  plot  the  page  fault  rate.  F.  versus  r.  A  well 
behaved  fault  rate  is  a  nonincreasing  function  of  r.  An  increasing  portion  of  the  curve  indicates 
that  a  r-F  anomaly  exists  in  that  region.  Consider,  for  example.  Figure  2-2e  lor  program  HWSCRT 
for  0=200  pages  (solid  line).  The  fault  rate  increases  from  123  to  1S8  when  r  increases  1  rom  10.000 
to  15.000.  Another  anomaly  exists  in  the  r  region  (901.051).  The  anomalies  reported  in  Figures  2- 
2a  -  2-2e  are  summarized  in  Table  2-2.  For  each  0  value  and  for  each  'rogram  we  report  the 
number  of  r-F  anomalies  !N)  and  the  size  of  the  largest  anomaly.  AF.  The  anomaly  size  is  gr» en  b> 
A F  =  /•' ( 0  T-)  —  /•' ( 0 . r j ) .  From  Figures  2-2a  -  2-2e  and  l  abie  2-2  it  is  c;-.  ir  that  the  t  a u  1 1  rate  is 
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Individual  program  anomalies  can  occur  in  a  multiprogramming  system  since  a  program's 
fault  rate  may  decrease  at  the  cost  of  an  increase  in  some  other  program's  fault  rate.  However,  as 
long  as  the  total  system  fault  rate  decreases  with  increasing  t.  individual  anomalies  are  not  of  prac¬ 
tical  importance.  We  would,  however,  like  to  point  out  that  anomalies  do  exist  even  for  the  sys¬ 
tem  fault  rate.  Parameter-fault  rate  anomalies  are  reported  in  Figures  2-3a  and  2-3b.  where  F  is  the 


2-3b:  MPL=10.  9=  50  — .  1(M) - .  150  ....  200  - 


Figure  2-3:  System  parameter  fault  rate  anomalies 


system  fault  rate,  for  MPL-5  and  10.  Figure  2-3a  is  a  plot  of  the  system  fault  rate  versus  r  when 
the  multiprogramming  mix  contains  5  programs.  Two  0  values  are  used  in  this  plot.  0-100  (solid 
line)  and  0-150  pages.  For  0-150.  anomalies  exist  for  larger  values  of  r  than  those  exhibited  for 
0-100  pages.  The  system  fault  rate  versus  r  when  10  processes  are  present  in  this  system  (two 
copies  of  each  program)  is  shown  in  Figure  2-3b.  Four  plots  are  shown  for  four  values  of  0:  0-  50. 

Table  2-3a 

Parameter-fault  rate  anomalies  (SYSTEM.  MPL-10) 

0 _ Parameter _ Fault  Rate _ 
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150  !  551 


1100 


29332 

32175 

32038 

23885 

23402 

23367 

22870 


6241 


7696 


7555 


8569 


8528 


8100 


8388 


8326 


8176 


8292 


8178 


32376 

33299 

32370 

24058 

23507 

23405 

23459 


8905 


8748 


9280 


9258 


9223 


8870 


8514 


8621 


8606 


8425 


8378 


u 

I 

_ i 

7000 

|  7100 

8100 

8293 

193 

7600 

229 

8293 

64  1 

!  600 


!  1200 


1600 


I  1800 


!  2000 


2200 


2400 


3000 


8100 


700 


I  1500 


|  1700  | 


|  1900  i 


1  2100 


|  2300 


i  2700  I 


i  5900  ; 


81 


686 


189 


60  ! 


154  j 


22  ' 


16  I 


41  ! 


2039 


23 


i 


a 

9 

9 


►*\ 

a 


f!* 


100,  150.  200.  For  0-50.  the  anomalies  exist  with  small  values  of  r  (r<100).  This  represents  the 
case  of  a  high  memory  contention  as  does  0  ^  20  for  MPL-3.  The  anomalies  demonstrated  by  Fig¬ 
ures  2-3a  and  2-3b  are  summarized  in  Tables  2-3a  and  2-3b. 

In  Tables  2-3a  and  2-3b  we  report  all  the  anomalies  exhibited  at  the  system  level  for  MPL-5 
and  MPL-10,  Each  anomaly  region  is  described  by  two  values  of  r  (tj  and  r2)  and  the  two 
corresponding  values  of  the  fault  rate  (F  j  and  F2).  The  anomaly  size,  AF,  is  measured  as  the 
difference  between  F2  and  F j.  For  large  values  of  0  (0-150.  200)  the  anomalies  occur  with  larger 
values  of  r.  Table  2-3b  shows  that  the  anomaly  region  for  0=100  occurs  with  r  <551.  whereas  for 
0=150  it  starts  with  r>551. 

The  significance  of  the  anomalies  is  emphasized  by  both  the  size  and  the  number  of  anomalies. 
Figures  2-3  show  that  the  anomalies  do  not  occur  in  the  same  r  region  when  different  0  values  are 


used;  this  further  complicates  the  control  of  the  WS  fault  rate  function.  Furthermore,  such 
anomalous  behavior  provides  suitable  conditions  for  the  existence  of  system  memory-fault  rate 
anomalies,  discussed  in  a  later  section. 


rj  2.6.2.  Parameter-virtual  memory  anomalies 

*r* 


A  parameter-virtual  memory  anomaly  (t-V)  in  a  multiprogramming  system  exists  for  some 
Tj.  T 2  and  0.  if 


Table  2-3b 

Parameter-fault  rate  anomalies  (SYSTEM.  MPL-5) 
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r,  >  r2  and  V(r,.0)  <  V(r2,0). 

The  anomaly  size  is  given  by:  AV  -  V(t,.0)  -  V(t2.0)  Figure  2-4  illustrates  parameter-virtual 
memory  anomalies  for  programs  FIELD.  IXIT,  and  HWSCRT  for  MPL-5  and  0-100.  It  is  obvious 
from  the  plots  in  Figure  2-4  that  the  average  virtual  memory  is  a  nonincreasing  function  of  r.  In 
Figure  2-5  V  is  plotted  versus  r  for  program  IXIT  and  0-50,  100.  150.  and  200.  The  anomalies  of 
Figure  2-5  are  summarized  in  Table  2-4.  Figures  2-5  and  Table  2-4  show  that  anomalies  associated 

MPL-5.  0-100.  FIELD-.  IN1T  ....  HWSCRT  -  -  -) 
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Figure  2-4:  Parametei -virtual  memory  anomalies 
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Figure  2-5:  Parameter-virtual  memorv  anomalies 
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with  the  larger  values  of  0  tend  to  be  shifted  to  the  right  of  the  anomalies  associated  with  smaller 
values  of  6.  The  r-V  anomalies  when  0=200  exist  for  r>  10.000.  whereas  for  0=150  anomalies 
occur  in  the  region  r  <  7000.  Increasing  the  memory  space  available  on  the  system  may  eliminate 
the  anomalies  in  one  region  of  r-values  and  generate  other  anomalies  in  another  region  with  larger 
values  of  t. 

The  overall  system  virtual  size  is  obtained  by  summing  up  the  virtual  sizes  of  the  individual 
processes.  Figures  2-6a  and  2-6b  demonstrate  r-V  anomalies,  where  V  is  the  system's  average 
memory,  for  0=  100  and  200,  respectively.  Each  figure  contains  three  plots  for  MPL=4,  5.  and  10. 

The  average  virtual  memory  of  a  process  can  be  reduced  only  as  a  result  of  a  swapping  pro¬ 
cess.  It  is  very  likely  that  a  swapped  out  process,  when  reactivated,  can  not  allocate  its  working  set: 
therefore,  it  initiates  the  swapping  mechanism.  A  chain  of  swapping  operations  will  definitely  lead 
to  a  reduction  in  the  average  memory  space  allocated  to  all  processes.  Consequently.  r-V  anomalies 
exist  at  the  individual  process  level  as  well  as  at  the  system  level. 

Upon  reducing  the  average  virtual  memory  allocated  to  a  program  or  to  the  system,  as  a 
result  of  a  parameter-virtual  memory  anomaly,  the  fault  rate  is  expected  to  increase,  assuming  that 
the  fault  rate  function  of  virtual  memory  is  well  behaved,  i.e..  F (r,)<F (r;)  if  V (r,)> V (r2). 
This  suggests  that  a  parameter-virtual  memory  anomaly  should  be  associated  with  a  parameter- 
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Parameter-virtual  memory  anomalies  (INIT.  MPL=5) 


LA_i 

Parameter 

Average 

Virtual  Memory 

] 

i 

i. 

Ti 

t2 

\'(  0  ,r ! ) 

V(0.t2) 

AV 

!  50  j 

* 

61 

9.15 

S.S2 

0.33 

j  1(H)  ' 

501 

601 

29.5 

2S.S 

0.7 

! 

701 

SOI 

30.3 

29.7 

0.6 

150  , 

soi 

S51 

34.0 

33.4 

0.6 

2000 

2500 

46.4 

40.6 

5.S 

t  ; 

!  ; 

5000 

6000 

43.5 

34.6 

S.9 

6500 

7000 

43.5 

35.0 

S.5 

2oo 

10.000 

30.000 

54. S 

46.4 

S.4 

27 


fault  rate  anomaly.  However,  the  results  obtained  from  our  experiments  show  that  this  is  not 
always  the  case.  To  illustrate  this  observation  Figure  2-7  presnts  a  plot  of  the  page  fault  rate  and 
the  average  virtual  memory  versus  r  for  program  1N1T  (0=150  and  MPL=5).  In  this  figure  a  r-V 
major  anomaly  occurs  in  the  r  region  [2000.2500],  The  average  working  set  size  drops  from  46.4  to 
40.6  pages  as  r  increases  from  2000  to  2500.  In  the  same  region,  the  fault  rate  drops  from  220  to 
209.  The  reduction  in  the  average  working  set  size  in  this  region  did  not  generate  extra  page  faults. 
However.  t-\  anomalies  in  the  regions  T=  [5000,6000]  and  [6500.7000]  are  accompanied  with  t-F 
anomalies  in  the  same  regions.  The  fault  rate  increases  from  198  to  226  as  the  average  working  set 
size  is  reduced  from  43.5  to  34.6  pages,  when  r  is  increased  from  r=6500  to  t=7000.  For  0=200.  a 
t-\  anomaly  (see  Table  2-4)  is  not  associated  with  a  t-F  anomaly.  Therefore,  a  parameter-fault 
rate  anomaly  does  not  always  accompany  a  parameter-virtual  memorv  anomaly. 

Similar  observations  are  made  when  the  average  working  sets  of  all  of  the  processes  (V'svs  )  are 
used  instead  of  one  process.  In  Figure  2-8  we  plot  Vsys  and  Fsts  versus  r  for  MPL-5  and  0=100  Six 
r-\  anomalies  are  exhibited  by  the  figure,  four  of  which  are  not  matched  with  t-F  anomalies.  For 
example.  V1VJ  drops  from  77.5  to  57.8  pages  (AV=20)  as  t  increases  from  t=  250  to  t=300.  In  the 
same  region  f!VS  drops  from  5627  to  5343  (AF=284). 


System.  MPL=5,  0=100 


Figure  2-V  Page  faults  and  average  virtual  memory  versus  r 
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The  fact  that  a  r-F  anomaly  does  not  necessarily  accompany  a  r-V  anomaly,  implies  that  WS 
may  overestimate  the  size  of  a  running  process's  working  set.  since  a  reduction  in  the  working  set 
size  may  not  result  in  a  subsequent  increase  in  the  fault  rate;  instead,  the  fault  rate  continues  to 
decrease.  In  other  words.  WS  may  accumulate  in  the  working  set  of  a  process  more  pages  than  it 
actually  requires.  This  is  especially  true  during  interlocality  transition  periods.  However,  it  is  also 
possible  for  WS  to  accumulate  redundant  pages  during  the  execution  of  a  phase,  rather  than  in 
transition  between  phases.  Assume  that  a  program  contains  a  large  locality  structure  (phase  A)  and 
several  smaller  phases.  A  properly  tuned  W’S  should  be  able  to  cover  the  locality  comprised  by 
phase  A.  The  choice  of  a  t  value,  large  enough  to  cover  phase  A.  may  result  in  covering  several 
smaller  phases  before  or  after  executing  phase  A.  As  a  result  of  choosing  large  value  for  r.  some 
pages  from  previous  phases  may  continue  to  be  members  of  the  working  sets.  Thus,  the  conclusion 
that  "the  W'S  serves  as  a  dynamic  measure  of  a  program's  memory  demand"  [20]  is  not  accurate. 

The  results  reported  in  this  section  show  that  WS  may  overestimate  the  memory  requirements  of  a 

program. 

2.6.3.  Virtual  memory-fault  rate  anomalies 

A  virtual  memory-fault  rale  anomaly  (V-F)  in  a  multiprogramming  system  exists  for  some  9. 
t j  and  r2.  if 

Y(0.r,)  >  V(0.r2)  and  FCQ.r,)  >  F(0.r;)  . 

The  existence  of  virtual  memory-fault  rate  anomalies  is  due  to  the  existence  of  only  one  of  either 
the  parameter-fault  rate  anomaly  or  the  parameter-virtual  memory  anomaly  in  the  same  r  region. 
The  existence  of  both  anomalies  in  the  same  r  region  eliminates  the  possibility  of  exhibiting  a  vir¬ 
tual  memory-fault  rate  anomaly.  This  observation  is  illustrated  in  the  following  three  cases. 

ill  r,  >  r-.  Y(  r,  i  >  Y  r;)  and  Ffr,)  >  F(r2).  r-F  and  V-F  anomalies 

1  2  1  r  •  >  r;.  \’(  r .  1  <  Y(  r; )  and  F(r  j '  <  F(  r2 ).  r-Y  anti  Y-F  anomalies 


e," 


v- 


■ 


S 


’.V 


* 

*»  • 


■;t 


6 


i 


29 


Li, 


■v1 


/ 


(3)  r,  >  r2,  vex])  <  V(r2)  and  F(r,)  >  F(r2).  r-F  and  r-V  anomalies 

In  the  first  case,  there  exist  a  virtual  memory-fault  rate  and  a  parameter-fault  rate  anomalies; 
however,  there  exists  no  parameter-virtual  memory  anomaly.  In  the  second  case,  there  exist  a  vir¬ 
tual  memory-fault  rate  and  parameter-virtual  memory  anomalies  but  not  a  parameter-fault  rate 
anomaly.  In  the  third  case,  both  parameter-fault  rate  and  parameter-virtual  memory  anomalies 
exist  but  the  virtual  memory-fault  rate  anomaly  does  not  exist.  All  of  these  cases  do  in  fact  exist, 
as  was  shown  in  the  previous  section  in  Figures  2-7  and  2-8. 

The  virtual  memory-fault  rate  anomalies  are.  graphically,  illustrated  in  Figure  2-9  where  we 
plot  the  page  fault  rate  as  a  function  of  the  average  virtual  memory  for  program  INIT  for  0  =  30 
and  MPL=3.  The  anomalies  in  the  figure  are  indicated  by  the  increasing  portions  of  the  curve.  V-F 
anomalies  exist  at  the  system  level  as  well.  In  the  previous  subsection  we  observed  that  r-V 
anomalies  are  not  always  accompanied  with  a  r-F  anomaly,  a  condition  necessary  for  the  existence 
of  V-F  anomalies. 

The  V-F  anomalies  are  particularly  significant  since  they  distort  the  shape  of  a  life  time 
curve,  which  is  the  inverse  of  the  fault  rate  plotted  versus  the  average  virtual  memory.  Life  time 
curves  are  used  to  model  program  behavior.  Besides,  some  optimal  multiprogramming  management 
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strategies  make  use  of  life  time  curves,  e.g..  the  primary  knee  criterion  [20],  Most  importantly,  V-F 
anomalies  prove  that  WS  tends  to  accumulate  more  pages  in  the  working  set  of  a  program  than  it 
actually  needs.  Furthermore,  the  existence  of  V-F  anomalies  suggests  that  the  working  set  of  a  pro¬ 
cess  need  not  be  prepaged  into  main  memory  after  it  has  been  swapped  out.  In  fact,  swapping 
allows  a  process  to  re-evaluate  its  working  set  and  demand  paging,  after  a  swapping  operation, 
allows  a  process  to  remove  redundant  pages  which  could  have  accumulated  in  its  working  set. 

2.6.4.  System  memory-fault  rate  and  system  memory-virtual  memory  anomalies 

One  would  like  to  control  the  fault  rate  of  individual  processes  or  of  the  entire  system  by 
controlling  the  amount  of  memory  available  on  the  system.  Such  control  is  viable  if  the  fault  rate 
does  not  increase  when  0  increases.  System  memory-fault  rate  anomaly  (0-F)  exists  if.  for  some 
0,.0;>  and  r 

9l>9::  and  F (r ,91)> F (r ,92) 

w  here  F  is  the  fault  rate  of  one  process  or  of  the  entire  system.  This  anomaly  type  can  exist  only 
in  multiprogramming  systems  where  the  amount  of  memory  available  on  the  system  dynamically 
changes.  Increasing  the  maximum  memory  allowable  on  the  system  can  be  thought  of  as  a  means  of 
reducing  the  page  fault  rate  of  individual  programs  or  of  the  whole  system.  Contrary  to  one's 
expectation  the  fault  rate  may  increase  with  increasing  the  maximum  memory  available  on  the  sys¬ 
tem. 

Our  empirical  results  show  that  WS  exhibits  0-F  anomalies  for  both  system  fault  rate  and  the 
individual  processes'  fault  rale.  For  MPL=3.  the  fault  rate  achieved  with  0=14  is  larger  than  that 
achie'.ed  with  0=12  for  r  values  20-30.  For  MPL=4.  the  fault  rale  achieved  with  0=150  can  be 
larger  than  that  achieved  w  ith  0=100  by  as  much  as  1512  faults,  as  shown  in  Table  2-5.  Similar 
observations  are  made  lor  MPL=5  and  10.  For  example,  for  r=151  and  0|=  1O().0;=15O. 
F  '<  r .&-■)>  F  (  r.0; ). 


S> stem  memory-virtual  memory  anomalies  ( 0-V)  exist  in  the  same  wav'  as  do  sv stem 
parameter-1  auli  rate  anomalies.  0-Y  exists  .!  .  for  some  0;.  0^  and  7 


Table  2-5 

System  memory-fault  rate  anomalies 
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0=100 

0=150 

AF 

15 

8768 

10280 

1512 

255 

5123 

5125 

2 

265 

5054 

5059 

5 

385 

4403 

4408 

5 

395 

4334 

4336 

2 

0!>02  and  V(f.0,) < V (r,02) 

i.e..  the  average  virtual  memory  allocated  to  a  process  decreases  instead  of  increases  when  0  is 
increased. 

The  anomalies  reported  in  this  section  are  not  exclusive.  There  are  many  other  anomalies,  of 
all  discussed  types,  which  are  not  reported  here;  however,  the  figures  and  tables  presented  in  this 
section  are  sufficient  to  illustrate  the  anomalous  behavior  of  WS  in  multiprogramming  systems. 


2.6.5.  Explaining  the  anomalies 

The  WS  policy  is  a  local  dynamic  memory  management  policy  and.  therefore,  the  programs  in 
a  multiprogramming  system  may  affect  each  other's  working  sets  through  swapping  as  discussed 
earlier  in  Section  1  of  this  chapter.  The  swapping  activity,  thus,  may  oe  responsible  for  the 
unpredicted  paging  behavior.  Our  empirical  results  show  that  the  swapping  activity  in  a  multipro¬ 
gramming  system  is  indeed  the  main  reason  for  the  existence  of  anomalies.  To  illustrate  this  obser¬ 
vation  we  record  for  each  r  the  swapping  rate.  S  (0.7).  The  swapping  rates  associated  with 
parameter-fault  rate  anomalies  of  program  !.\IT  (MPL=3)  are  presented  in  Table  2-6a.  This  table 
includes  all  the  anomalies  exhibited  by  program  INIT  in  order  to  illustrate  the  effect  of  swapping 
on  the  fault  rate.  Consider,  for  example,  the  table  entry  for  0=11.  r,=2l.  and  r2= 66.  The  fault 
rate  increase.  AF.  is  262  page  faults  and  8(0.7;,)  =  2635  >  S10.7,)  =  1600.  Note  that  the  swapping 
rate  increase.  S(0.7:)  -  S( @ .7 j )  =  1635.  is  much  larger  than  the  fault  rate  increase.  AF  =  262.  The 
reason  for  this  difference  is  that  not  all  the  pages,  previously  swapped  out.  will  have  to  be  paged 
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were  paged  in;  they  remained  in  the  working  set  of  a  program  until  a  swapping  operation  occurred. 
Accumulation  of  unnecessary  pages  is  viable  because  the  window  size.  r.  can  be  large  enough  to 
cover  more  than  one  of  the  program  localities,  as  has  been  discussed  earlier.  Similar  to  V-F 
anomalies,  this  observation  suggests  that  the  working  sets  of  a  swapped  out  program  need  not  be 
brought  entirely  into  memory  once  the  program  is  rescheduled  for  execution.  Such  a  strategy  is 
further  supported  by  the  fact  that  a  swapping  rate  increase  does  not  necessarily  produce  fault  rate 
increase.  The  choice  of  this  strategy  in  this  study  is.  therefore,  justified.  Moreover,  it  further  weak¬ 
ens  the  claim  that  the  WS  serves  as  a  measure  of  program  demand. 

The  swapping  activity  is  also  responsible  for  parameter-virtual  memory  anomalies  (see  Table 
2-6b.)  This  is  obvious,  since  a  swapping  operation  removes  the  working  set  of  a  process  from  main 
memory,  thus  equating  the  working  set  size  to  zero.  This  by  itself  does  not  generate  anomalies. 
Anomalies  by  definition  are  related  to  the  WS  parameter  r.  Therefore,  if  the  swapping  rate  gen¬ 
erated  under  a  larger  value  of  r  is  more  than  that  generated  under  a  smaller  value  of  r.  then  there 

Table  2-6b 

Parameter-virtual  memory  anomalies  UNIT) 

0  r,  r~72  I  V(0.r,)  V(6,t2)  S(0.r,)  S(0.r2) 
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is  a  chance  for  the  anomalies  to  appear.  A  plot  of  the  swapping  rate  of  the  system  versus  r  is  given 
in  Figure  2-10  for  MPL=5  (0=100.  150,  200).  Figure  2-10  shows  that  the  swapping  rate  is  an 
increasing  function  of  r  most  of  the  time.  Moreover,  swapping  occurs  at  relatively  large  values  of 
r.  For  0=200,  the  swapping  rate  curve  is  shifted  to  the  right  of  that  for  0=150  and  0=100;  swapping 
occurs  at  larger  values  of  r. 

A  swapping  rate  increase  that  results  in  a  parameter-virtual  memory  anomaly,  but  not  in  a 
parameter-fault  rate  anomaly,  produces  a  virtual  memory-fault  rate  anomaly  as  discussed  earlier 
in  this  section.  Hence,  a  swapping  rate  increase  that  results  in  a  parameter-fault  rate  anomaly,  but 
not  in  a  parameter-virtual  memory  anomaly,  results  in  a  virtual  memory-fault  rate  anomaly. 
Moreover,  a  system  memory-fault  rate  anomaly  has  been  shown  to  be  preceded  by  a  parameter- 
fault  rate  anomaly.  Therefore,  it  can  be  concluded  that  the  swapping  activity  in  multiprogram¬ 
ming  systems  is  the  main  reason  for  the  anomalous  behavior  discussed  in  this  chapter. 


Svstem.  MPL=5,  0=100  — .  150  - .200  ... 


Figure  2-10:  Snapping  rate  versus  r 


2.7.  Summary  and  Conclusions 


This  chapter  has  demonstrated  WS  anomalies  in  multiprogramming  systems.  The  presence  of 
anomalies  is  of  theoretical  interest  in  itself.  However,  we  found  that  the  anomalies  are  far  too 
numerous  to  be  considered  only  of  pathological  or  contrived  nature.  Practically,  the  existence  of 
anomalies  complicate  the  control  process  of  WS  policy.  The  WS  parameter,  r.  may  not  be  used  in  a 
straightforward  manner  to  control  the  fault  rate  in  the  system  and  memory  allocation.  Moreover. 
WS  anomalies,  especially  the  parameter-virtual  memory  anomaly,  illustrate  how  WS  overestimate 
a  process's  working  set  and.  hence,  memory  could  be  overcommitted  during  the  execution  of  a  pro¬ 
cess. 

Furthermore,  this  study  suggests  that  results  obtained  from  uniprogramming  studies  should 
not  be  used  in  a  simplistic  manner  to  arrive  at  multiprogramming  paging  strategies.  The  WS  policy 
exhibits  only  certain  types  of  anomalies  in  a  uniprogramming  system.  In  a  multiprogramming  sys¬ 
tem.  performance  measures  depend  not  only  on  the  intrinsic  behavior  of  a  program  but  also  depend 
on  the  behavior  of  other  processes  in  the  system.  Interaction  between  processes  takes  place  through 
paging  in  global  policies  such  as  global  LRU  and  through  swapping  in  local  policies  such  as  WS. 

The  WS  anomalies,  together  with  the  WS  high  cost  of  implementation,  leaves  open  the  search 
for  a  better  policy  for  managing  memory  hierarchies  in  multiprogramming  systems.  The  next 
chapter  presents  a  new  approach  to  the  memory  management  problem.  A  parameterless  policy  is 
proposed  which  can  respond  to  the  memory  requirements  of  a  program  taking  into  consideration 
the  requirements  of  other  processes  in  the  system. 
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CHAPTER  3 

CD:  A  COMPILER  DIRECTED  MEMORY  MANAGEMENT  POLICY 


The  idea  of  using  memory  directives  (MD)  for  the  management  of  memory  hierarchies  in  a 
multiprogramming  virtual  memory  system  (VM)  has  been  hinted  at  by  many  authors.  Madison 
and  Batson  [30]  suggested  that  if  program  localities  generated  by  the  BL1  model  could  be  correlated 
to  the  source  level  code,  then  it  would  be  possible  for  the  compiler  to  generate  MD  to  identify  pro¬ 
gram  localities  at  run  time.  Abu-sufah  [5]  suggested  the  use  of  data  dependence  graphs  to  isolate 
the  localities  at  the  source  level.  In  his  Ph.D  thesis  Abusufah  found  that  the  localities  of  numerical 
programs  in  a  paged  system  generated  by  the  BLI  model  are  due  to  loop  structures.  A  similar  con¬ 
clusion  was  made  by  Malkawi  [31]  for  segmented  systems.  The  use  of  memory  directives  for 
optimal  memory  management  was  also  suggested  by  Hagmann  and  Fabrv  [27]  and  by  Kearns  and 
DeFazio  [29],  Except  for  [5]  and  [3 1  ]  none  of  the  researchers  have  proposed  any  particular  MD  to  be 
used.  Abu-sufah  proposed  a  directive  called  allocate  which  has  the  function  of  locking  a  page  in 
memory  if  it  can  be  identified  as  a  member  of  a  program  locality.  When  the  program  moves  to 
another  locality  phase,  a  deallocate  routine  is  called  to  release  those  pages  allocated  during  the  exe¬ 
cution  of  the  previous  locality.  Abu-sufah  suggested  that  a  program  has  to  be  transformed  [2] 
before  allocate  and  deallocate  can  be  effectively  used.  Program  transformation  requires  the  use  of 
data  dependence  graphs  to  resolve  data  dependencies.  The  directives  suggested  by  Abusufah  fail  to 
reflect  the  hierarchical  structure  of  program  localities  which  is  a  common  locality  characteristic 
[30].  Besides,  allocate  and  deallocate  can  not  respond  to  the  dynamic  change  in  the  memory  status 
of  a  multiprogramming  system. 

rhe  idea  of  using  MD  has  been  practically  implemented  in  real  systems.  Both  VAX  VMS  anil 
Berkeley  UNIX  allow  the  user  to  lock  and  unlock  some  pages  in  physical  memory.  The  effectiveness 
oi  such  facilities  in  VAX  VMS  was  illustrated  by  A  baza  [  1  j.  w  ho  showed  that  he  performance  of 
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some  numerical  algorithms  can  be  enhanced  if  the  directives  provided  by  VAX/VMS  are  properly 
used.  However,  one  would  like  to  free  the  user  from  having  to  call  a  system  routine  to  lock  or  to 
release  a  page,  and  having  to  isolate  a  page  that  should  be  locked  in  memory  in  order  to  achieve  a 
better  performance.  Besides,  a  user  may  not  be  able  to  determine  which  page  should  be  locked  and 
when  it  should  be  released,  unless  he  has  the  proper  knowledge  of  his  program  behavior  as  well  as 
the  knowledge  of  the  system. 

In  this  thesis,  three  memory  directives  are  designed  to  achieve  two  goals.  The  first  one  is  to 
allocate  X  physical  page  frames  to  a  running  process's  resident  set.  A  directive,  designed  for  this 
purpose,  should  be  able  to  define  the  size  of  a  program's  resident  set  and  allocate  enough  physical 
pages  to  accommodate  it.  In  this  study,  such  a  directive  is  called  ALLOCATE.  The  second  goal  is  to 
lock  a  page  or  set  of  pages  in  main  memory.  A  locked  page,  by  definition,  is  exempted  from  being 
paged  out  by  the  page  replacement  mechanism.  A  directive  is  developed  for  this  purpose,  and  called 
in  this  thesis  LOCK.  LOCK  has  a  similar  function  to  the  directive  proposed  by  Abusufah  [5]  and  to 
the  system  facilities  provided  by  VAX/ VMS  and  Berkeley  UNIX  (VMS  and  UNIX  user  manuals). 
A  page  that  has  been  locked  in  memory  by  LOCK  is  unlocked  by  a  directive  called  UNLOCK.  Later 
in  this  chapter,  we  shall  discuss  a  case  in  which  the  operating  system  (OS)  is  entitled  to  release  a 
page  before  UNLOCK  does  so.  ALLOCATE.  LOCK,  and  UNLOCK  are  discussed  in  greater  detail  in 
the  following  sections. 

Based  on  the  three  directives  developed  in  this  study,  a  compiler  directed  memory  manage¬ 
ment  policy  (CD)  is  proposed.  CD  operates  as  follows.  At  compile  lime,  a  preprocessor  generates 
directives  ot  the  type  ALLOCATE.  LOCK,  and  UNLOCK.  These  directives  are  inserted  at  appropri¬ 
ate  locations  into  the  compiled  object  code  of  a  user's  program.  At  execution  time,  the  directives  are 
executed  by  the  CPL  When  a  directive  is  executed.  CPU  generates  a  call  to  a  particular  OS  routine 
responsible  lor  processing  and  handling  memory  directives.  Figure  3-1  presents  a  block  diagram  of 
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Figure  3-1:  Block  diagram  of  a  compiler  directed  memory  management  policy 
3.1.  Memory  Directive:  ALLOCATE 

One  of  the  major  problems  a  memory  management  policy  has  tc  solve  is  the  amount  of  physi¬ 
cal  memory  that  should  be  allocated  to  a  program  during  its  execution.  Run  time  policies,  whether 
static  or  dynamic,  determine  the  number  of  pages  to  be  alloca'ed  at  run  time  as  discussed  in  the 
first  chapter  It  has  been  shown  in  Chapter  2  that  WS.  a  dynamic  run  time  policy,  may  overesti¬ 
mate  a  program's  memory  requirements.  Compiler  directed  memory  management  policies  estimate 
the  memory  requirements  of  a  program  at  compile  time,  using  source  level  information,  and  passed 
to  the  OS  through  ALLOCATE,  which  is  designed  in  accordance  with  locality  characteristics  ol  pro¬ 
gram  behavior  and  the  constantly  changing  free  memory  space  available  on  a  multiprogramming 
VM  svstem. 
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3.1.1.  Locality  characteristics  of  numerical  programs 

A  locality  structure  may  result  from  data  structures  created  at  run  time.  e.g.  stacks,  or  from 
data  structures  declared  in  the  source  code  of  a  program.  The  later  case  is  considered  in  this  thesis. 
The  BLI  model  of  program  localities  [30]  suggests  that  array  references  inside  loop  structures  of 
numerical  programs  are  the  main  reason  for  the  existence  of  localities  at  execution  time  [5],  [31].  A 
nested  loop  structure  produces  a  hierarchical  locality  structure.  Such  structure  defines  one  of  the 
locality  characteristics,  namely  the  level  of  a  locality  in  the  hierarchy  of  localities.  Another  charac¬ 
teristic  of  major  significance  to  our  study  is  the  virtual  size  of  a  locality.  The  time  duration  is  also 
a  locality  characteristic  as  seen  in  [5],  [30],  [31].  Consider  Example  3-1  for  illustration. 

Example  3-1  shows  a  FORTRAN-ike  piece  of  code.  The  maximum  nest  depth  of  the  loop 
structure  is  three.  Two  arrays,  E  and  F.  are  referenced  inside  loop  20.  Arrays  E  and  F  are  refer¬ 
enced  in  a  row  major  order,  i.e..  the  elements  of  a  row  are  referenced  while  the  current  column 
index,  I.  is  fixed.  The  elements  of  an  array  are  stored  in  a  column  major  order;  this  assumption 

Example  3-1: 

DO  10  1=  1,  N 
DO  20  J=l,  M 
E(IJ)  =  F(IJ) 

20  CONTINUE 

DO  30  K-1.M 
G(K4)  =  H(K,I) 

DO40L=l,NN 
V(L)  -  V(L)*2 
40  CONTINUE 
30  CONTINUE 
10  CONTINUE 

The  localities  of  the  above  code  are  illustrated  in  the  following  diagram: 


Level  1:  Loop  10 
Level  2:  Loop  30 


level  3:  Loop  40 
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holds  throughout  this  thesis.  Every  element  of  arrays  E  and  F  is  referenced  one  time  during  one 
iteration  of  loop  10.  i.e..  the  entire  virtual  space  of  arrays  E  and  F  is  spanned  during  one  iteration 
of  loop  10  due  to  a  full  execution  of  loop  20.  Hagmann  and  Fabry  called  this  type  of  referencing 
pattern  total  [27],  Note  that  there  are  M  iterations  of  loop  20  per  each  iteration  of  loop  10.  There¬ 
fore.  a  locality  comprised  by  loop  10  includes  the  virtual  spaces  of  E  and  F. 

Loop  10  is  the  outermost  loop  which  forms  the  highest  level  locality,  or  level  one  locality  as 
termed  in  [30], 

Arrays  G  and  H  are  referenced  in  a  column  major  order  inside  loop  30.  When  loop  30  exe¬ 
cutes.  the  column  elements  of  arrays  G  and  H  are  referenced  sequentially,  while  the  column  index. 
1  .  is  fixed  at  the  outer  loop  level  (loop  10).  Since  the  elements  of  one  column  are  stored  in  consecu¬ 
tive  pages,  according  to  the  storage  scheme,  the  locality  at  this  level  includes  only  the  virtual  space 
of  the  column  being  referenced  in  the  virtual  space  of  arrays  G  and  H.  The  index  /  takes  a  new 
value  only  when  loop  10  reiterates.  The  elements  of  a  new  column  will  be  referenced  during  the 
next  iteration  of  loop  30.  In  other  words,  the  virtual  space  spanned  during  the  execution  of  loop  30 
is  determined  by  the  new  column  elements  of  G  and  H.  Consequently,  references  to  G  and  H  inside 
loop  30  form  a  locality  as  long  as  loop  30  remains  active.  However,  every  time  loop  30  resumes 
execution  a  new  set  of  pages  form  the  locality.  In  the  diagram  of  Example  3-1.  localities  formed  by 
loop  30  are  illustrated  by  G  XJI ,.  ■  ,G„  Ji  „  .  where  G,  is  the  virtual  size  of  column  i  of  array  G. 

A  one-dimensional  array,  vector  V.  is  referenced  inside  loop  40.  During  the  execution  of  loop 
40.  the  virtual  space  of  V  is  spanned  totally.  The  virtual  space  of  V  is  referenced  totally  during 
each  iteration  of  loop  30  and  ioop  10.  Therefore.  V'  participates  in  the  localities  formed  at  level  1 
and  2  as  well  as  at  level  3.  The  localities  are  illustrated,  graphically,  in  the  diagram  of  Example  3- 
1  Example  3-1  is  too  simple  to  illustrate  how  program  localities  can  be  automatically  extracted 
t  r^m  the  source  lev  el  code. 

Our  concern  here  is  with  the  hierarchical  characteristic  of  program  localities.  A  ■nacnoa'pic 
view  of  the  iocalitv  structure  exhibited  in  Example  3-1  shows  that  all  arravs.  H.  F.  G.  H  and  \  . 
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should  be  considered  part  of  the  program's  current  locality.  This  view  is  obtained  by  looking  at 
loop  10  as  indivisible  entity.  If  the  program's  memory  reference  pattern  is  observed  while  the  pro¬ 
gram  is  executing  loop  40.  the  program’s  current  locality  appears  to  include  vector  V  only.  Such  a 
microscopic  view  of  the  locality  structure  shows  that  the  smallest  program  locality  dominates  all 
other  localities.  This  illustrates  how  a  program  may  change  localities  within  a  given  locality  struc¬ 
ture.  I  intra-locality  transitions).  In  Example  3-1.  intra-localitv  transitions  occur  between  levels  1 
and  2  and  between  2  and  3. 

The  problem  of  intra-locality  transitions  was  treated  in  [9]  by  linearizing  the  locality  struc¬ 
ture.  To  linearize  a  locality  structure  consisting  of  two  levels  is  to  decide  that  one  of  the  locality 
levels  is  more  significant  than  the  other  at  some  time  instance.  Least  significant  localities  are 
dropped  from  the  locality  structure,  thus  leaving  only  one  path  connecting  all  locality  levels.  The 
difficulties  of  this  approach  are  cited  in  [5]  and  [3l].  Besides,  each  locality  level  in  a  locality  struc¬ 
ture  reflects  the  memory  referencing  behavior  of  a  program  during  a  particular  phase  of  the  pro¬ 
gram  execution.  In  Example  3-1.  while  the  program  executes  loop  40.  the  virtual  space  of  vector  V 
is  being  referenced  continuously,  irrespective  of  the  significance  of  level  3  locality  compared  to 
level  2  or  1.  Therefore,  the  locality  comprised  by  loop  40  is  significant  during  the  execution  of  loop 
40  and  the  locality  comprised  by  loop  30  is  significant  during  the  execution  of  loop  30  and  so  on. 
Allocating  the  outermost  loop  produces  the  minimum  possible  fault  rate  for  a  given  locality  struc¬ 
ture.  irrespective  of  its  relative  significance  to  other  levels,  since  the  virtual  spaces  of  all  referenced 
arrays  within  the  outer  loop  are  made  resident  in  memory.  However,  it  may  not  always  be  possible 
to  allocate  the  locality  comprised  by  the  outer  most  loop  (level  one  locality)  due  to  insufficient  tree 
memory.  In  such  cases,  the  availability  of  free  memory  should  determine  which  level  of  the  local¬ 
ity  structure  should  be  allocated. 

From  tne  above  discussion  the  following  observations  are  made.  The  highest  level  locality 
(level  1)  produces  the  lowest  possible  fault  rate,  when  allocated  completely  in  mam  memory.  That 
is  because  every  page  referenced  inside  a  level  one  locaiuv  is  paged  only  once  into  main  memory 
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(assuming  a  demand  paging  strategy),  and  will  not  be  replaced  by  the  page  replacement  strategy. 
The  allocation  of  a  level  one  locality  implies  that  the  resident  set  of  a  program  should  not  be  less 
than  the  number  of  pages  referenced  inside  the  locality.  However,  if  a  level  one  locality  is  too 
large  to  fit  in  the  main  memory,  the  next  lower  level  locality  should  be  considered  for  allocation 
(lower  level  localities  have  a  smaller  size  than  higher  level  localities).  In  other  words,  a  program 
settles  down  to  a  microscopic  view  of  its  locality  structure.  If  the  second  level  locality  can  not  be 
allocated,  the  third  level  locality  is  tried  for  allocation,  and  so  forth.  A  program  may  keep  recon¬ 
sidering  its  lower  level  localities  for  allocation  as  long  as  there  exists  at  least  one  more  lower  level 
locality.  The  program  should  not.  however,  be  allowed  to  run  if  the  lowest  level  locality  can  not  be 
allocated:  this  restriction  is  necessary  to  prevent  thrashing.  Assume  that  the  lowest  level  locality 
contains  N  pages  and  there  are  only  N-l  free  memory  pages.  N-l  pages  from  the  lowest  level  local¬ 
ity  may  reside  in  main  memory  and  one  page  has  to  be  maintained  in  virtual  memory.  Every  time  a 
reference  is  made  to  the  N'h  page  (in  virtual  memory),  a  page  has  to  be  removed  from  the  main 
memory.  A  reference  to  a  replaced  page,  in  the  future,  will  cause  a  page  fault  which  results  in 
replacing  another  page.  The  outcome  of  this  cyclic  faulting  process  is  a  short  life  time  between  suc¬ 
cessive  faults,  a  phenomenon  known  as  thrashing. 

These  observations  lead  to  two  key  principles  underlining  the  design  of  the  ALLOCATE  direc¬ 
tive.  First,  the  highest  level  locality,  level  onp  in  the  hierarchical  locality  structure,  is  favored  over 
localities  of  lower  levels  for  allocation  purposes.  Secondly,  the  lowest  level  locality  in  a  hierarchi¬ 
cal  locality  structure,  imposes  a  lower  1  ini  it  on  the  memory  space  that  should  be  allocated  to  a  run¬ 
ning  process.  These  two  principles  reflect  the  dynamic  change  of  the  program's  memory  demand 
due  ;o  intrinsic  properties  of  program  behavior.  The  failure  to  recognize  these  principles  may  lead 
to  improper  memory  allocation  strategies.  In  order  to  incorporate  the  above  principles  into  Ml), 
each  locality  at  some  level  in  the  hierarchical  locality  structure  is  assigned  a  priority  index.  i\ 

Lp  to  this  end  one  can  recognize  two  primitives  for  ALLOCATE.  The  first  one  is  the  amount 
of  memory  to  be  allocated.  X.  given  by  the  virtual  size  of  a  locality  .  The  second  one  is  the  priority 


of  allocation,  P.  ALLOCATE  may  have  the  following  form 

ALLOCATE {P  .X) 

where  X  is  the  virtual  size  of  a  locality,  and  P  is  the  priority  index  associated  with  that  locality. 
Upon  executing  a  directive  of  type  ALLOCATE,  a  request  is  issued  to  the  operating  system  to  allo¬ 
cate  X  pages,  given  that  the  priority  of  allocation  is  determined  by  P.  Both  primitives.  P  and  X  will 
be  discussed  in  more  detail  in  Section  3.1.4. 

ALLOCATE,  in  its  simple  form  given  above,  can  not  respond  to  the  dynamically  changing 
amount  of  free  memory  space  in  a  multiprogramming  system.  The  amount  of  free  memory  space 
available  on  the  system  may  increase  if  a  running  process  completes  its  execution  and  returns  to  the 
system,  whatever  memory  it  has  occupied,  or  if  a  process  enters  a  new  phase  with  a  smaller  size 
locality,  thus,  adding  the  released  pages  to  the  free  memory.  On  the  other  hand,  the  free  memory 
may  shrink  in  size  if  a  new  process  is  added  to  the  system  or  if  a  running  process  enters  a  new 
phase  with  a  larger  size  locality.  Moreover,  the  above  form  of  ALL  OCATE  does  not  completely 
incorporate  the  first  principle  cited  above:  namely,  that  higher  level  localities  should  be  favored 
over  lower  level  ones.  To  account  for  these  two  drawbacks,  a  more  complex  form  of  ALLOCATE 
directive  is  given  below: 

ALLOCATE  (P  VX  i)  else  ( P2,X2)else  ■  else  (Pn  ,Xn  )  whereX^X,^  •  •  £X„ 

Each  ALLOCATE  directive  has  one  or  more  parameters.  Each  parameter  has  two  primitives 

enclosed  in  parentheses  "(P..XT.  At  any  level  of  a  locality  hierarchical  structure.  ALLOCATE  con¬ 
tains  a  parameter  associated  with  the  current  level  and  one  parameter  for  each  level  enclosing  the 
current  level.  The  order  of  parameters  in  ALLOCATE  is  such  that  parameters  associated  with 
higher  level  localities  precede  those  associated  with  lower  ones,  as  shown  in  Figure  3-2.  A  mul- 
linested  loop  structure  is  shown  in  Figure  3-2.  Each  loop  forms  a  locality  with  size  X  and  has  a 
priority  P  .  The  outermost  loop  forms  a  level  one  locality  with  A'  \  ami  P  The  directive  associated 
with  this  locality  :s  ALLOCATE  f/*,.X  j ).  Going  down  in  the  hierarchy  structure  to  the  second  loop 
•v  ith  rest  depth  ..  the  directive  reconsiders  the  allocation  of  the  prev  sous  locality  specified  in' 
P  j . .V  ;  i  before  it  considers  the  primitiv  es  of  the  second  lev  el  locality  specified  bv  1  P:.X  A.  and  so 
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forth. 


3.1.2.  Processing  of  ALLOCATE  directive  by  the  operating  system 
For  the  moment,  we  assume  that  directives  of  the  form 


ALLOCATE  (PvXx)  else  ( P2.X2)else 

have  been  inserted  into  the  program  s  code  at  compile  time.  At  run  time  the  directives  are  executed 


by  the  CPU.  Once  a  directive  is  executed,  a  system  routine  is  invoked  to  handle  its  processing. 
ALLOCATE  issues  requests  of  the  form  (P  x.X  t).(P2.X  2).  ■■■  in  the  same  order.  The  OS  first  receives 
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the  request  (Px.X  j)  and  tries  to  allocate  X  j  pages  from  the  available  free  memory.  If  X  t  pages  can 
not  be  allocated,  then  the  OS  examines  the  value  of  Px.  As  a  convention  P=1  is  chosen  to  be  the 
priority  of  the  lowest  level  locality  in  a  hierarchical  locality  structure.  Hence.  P  i>  1  means  that 
there  is  at  least  one  more  lower  level  locality,  and  at  least  one  more  directive  argument  (P2.X2) 
where  X2<Xx  and  P2<PX.  In  this  case,  the  program  is  allowed  to  continue  its  execution,  with  its 
current  memory  allocation  from  the  previous  directive,  until  the  next  request  (P2.X2)  is  received. 
Once  again,  if  .Y 2  can  not  be  allocated,  the  execution  continues  only  if  P2>  1.  This  process  continues 
until  a  memory  request  X,  is  allocated,  or  the  priority  of  the  request  is  Pt  =1.  In  other  words,  the 
program  exists  in  the  scope  of  its  lowest  level  locality,  or.  using  source  level  code  notation,  the  pro¬ 
gram  is  currently  executing  the  innermost  loop  of  a  multi-nested  loop  structure.  In  this  case.  OS 
either  suspends  the  program's  execution  or  invokes  a  swapping  mechanism  (SM).  The  choice 
between  these  two  actions  depends  on  the  priority  of  the  running  job  and  the  priorities  of  other 
jobs  existing  in  the  system  at  the  time  of  processing  a  directive.  In  the  performance  evaluation  of 
CD  it  was  assumed  that  all  processes  have  the  same  priority  and.  thus,  the  OS  invokes  SM  when¬ 
ever  it  has  to  make  a  choice.  SM  is  discussed  below  in  greater  details.  The  processing  of  ALLOCATE 
is  shown  in  Figure  3-3. 

In  Figure  3-3  the  priority  index  P=1  is  used  to  indicate  the  lowest  level  locality.  With  P*1 
associated  with  the  lowest  level  locality,  the  OS  simply  checks  whether  the  current  priority  is 
larger  than  one  or  not  in  order  to  determine  the  next  step  when  sufficient  memory  can  not  be  allo¬ 
cated.  Otherwise,  if  P=1  is  associated  with  the  highest  level  locality  and  P  is  increased  with  the 
increase  of  the  depth  of  the  locality  structure,  a  look-ahead  scheme  will  be  necessary  to  know  the 
relative  position  of  the  current  locality.  However,  assigning  P-1  to  the  lowest  level  locality, 
comprised  by  the  innermost  loops,  inhibits  the  use  of  a  one  pass  top  down  parsing  scheme  when  the 
directives  are  inserted,  as  w  ill  be  seen  in  Section  3.1.4. 
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3.1.3.  Swapping  mechanism 


The  OS  may  invoke  a  swapping  mechanism  (SM)  if  the  available  memory  space  is  not  enough 
to  allocate  the  current  request  and  the  priority  of  the  current  request  is  P=l.  Besides  being  able  :o 


invoke  SM.  CD  provides  a  strategy  for  partial  swapping,  using  the  priority  primitive  of  ALLO¬ 


CATE. 


In  regular  swapping  strategies,  a  process  is  selected  for  swapping,  according  to  some  criteria. 


and  its  resident  set  is  removed  Irom  main  memory.  We  cal!  this  strategy  total  swapping  as  opposed 


to  partial  swapping  strategy  (PS).  Partial  swapping  reduces  the  current  resident  set  of  a  process. 


selected  for  swapping,  to  a  smaller  value.  The  viability  of  partial  swapping  is  facilitated  bv  the 


priority  primitive  and  the  hierarchical  nature  of  ALLOCATE.  PS  operates  as  follows.  When 


invoked  by  a  directive  with  P=  1 .  the  swapper  searches  for  anv  process  occupying  memorv  space  ,V 


»jl  a 


47 

with  a  priority  /*,  >  1.  The  resident  set  of  such  a  process  is  reduced  from  X,  to  a  new  value  Xj .  Xt 
is  the  si2e  of  a  lower  level  locality  with  Xj  <X,  and  Pj  <  P, .  In  the  model  used  in  our  study  we 
reduce  the  resident  set  size  of  a  process  to  that  one  associated  with  P-1.  The  philosophy  behind  par¬ 
tial  swapping  is  that  a  process  A  may  find  enough  memory  space  to  allocate  its  largest  locality, 
while  process  B  can  not  allocate  its  smallest  locality.  This  may  happen  if  process  A  is  scheduled  to 
run  when  the  system  is  not  heavily  loaded,  while  process  B  enters  the  system  when  it  is  heavily 
loaded.  Forcing  all  processes  to  run  with  their  smallest  localities  allows  more  processes  to  share 
memory.  However,  thrashing  is  prevented  by  ensuring  that  every  process  is  allocated  enough 
memory  to  accommodate  one  of  its  localities,  no  matter  how  small  the  locality  is. 

Various  schemes  of  partial  swapping  could  be  implemented.  For  example,  a  multiple  queue 
could  be  used  to  hold  processes  w  ith  different  directive  priorities.  The  partial  swapping  mechanism 
would  transfer  the  processes  in  the  largest  priority  queue  to  the  next  lower  level  and  continues 
until  either  the  memory  request  is  satisfied  or  the  only  unempty  queue  is  the  one  with  P=l.  Total 
sw  apping  becomes  necessary  if  every  process  in  the  system  is  running  with  P-1.  Partial  swapping 
strategy  is  further  illustrated  in  Example  3-2. 


Example  3-2: 

Assume  that  two  processes  A  and  B  are  running  in  r.  system  with  120  memory  pages.  A  executes 
the  directive  MDA:  ALLOCATE  (3. 100)  else  (2,50)  else  (1.10)  and  B  executes  MDg  :  ALLO¬ 
CATED  1.25).  Assume  further  that  A  is  activated  first.  The  following  execution  time  intervals  (l,  ) 
are  observed: 

tl:  A  executes  MDA  .  The  first  request  (1.100)  is  granted  since  X=100  is  less  than  the  available  free 
memory  F-M-0-120.  The  status  of  A  is  SA  =100.  PA  =3  (S  is  the  resident  set  size): 
the  last  argument  of  MDA  (1,10)  is  saved  in  a  process  specific  record.  F=120-100 
=20  pages. 

t2:  Interrupt  occurs  and  B  is  activated. 

t3:  B  executes  MD3  The  first  request  (P=l,.\=25)  can  not  be  granted  because  \=25  is  larger  than 
F=20.  Since  P-1.  OS  invokes  SM.  The  partial  swapper  (PS)  finds  A  occupying  100 
pages  with  a  priority  larger  than  one.  P=3.  PS  reduces  SA  from  100  to  10  pages: 
5  .  :  ( 71  =3..Y  =100)  —  (P  =  1  X  =  10).  F=  120-10  =  110  pages.  Now  B's  request  can 
be  granted:  S3  =25  pages.  F=1 10-25=85  pages. 
t4:  Interrupt  occurs  and  B  is  activated. 

l5:  A  executes  MD  ■  .  The  first  request  (P=3.X=100)  can  not  be  granted  because  .V  =  100  >  F  —  95.  A 
continues  execution  with  its  previous  allocation  SA  =10  until  the  next  request  (2.50) 
is  received.  The  request  is  granted  since  .Y  =50  <  F—  85.  The  status  of  A  is 
S\  =  50  and  PA  =2.  F  =45. 

A  steady  state  :s  reached  with  25  pages  allocated  to  B  and  50  pages  allocated  to  A.  B  always  gets 
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the  25  pages  it  asks  for  since  the  request  has  a  high  priority  P-1.  A  cannot  be  allocated  100  pages 
as  long  as  B  is  in  the  system:  A  will  not.  however,  be  forced  to  run  with  10  pages  since  B  cannot 
invoke  SM  anymore. 

3.4.1.  Primitives  of  ALLOCATE  directive 

ALLOCATE  incorporates  two  primitives:  the  priority  index  P  and  the  memory  request  X. 
Both  primitives  are  discussed  in  the  following  subsections. 

3 .4.1.1.  Priority  primitive,  P 

The  hierarchical  nature  of  memory  demands  due  to  the  hierarchical  nature  of  locality  struc¬ 
tures  is  reflected  into  the  hierarchical  form  of  ALLOCATE  through  the  priority  primitive  P.  Recall 
that  the  allocation  of  the  highest  level  locality,  level  one,  achieves  minimum  page  fault  rate  dur¬ 
ing  the  execution  of  a  multi-nested  loop  while  the  allocation  of  the  lowest  level  locality, 
comprised  by  the  inner-most  loop,  is  sufficient  to  prevent  thrashing.  The  highest  level  memory 
demand  is  determined  by  the  highest  level  locality,  comprised  by  the  outermost  loop  of  a  multi- 
nested  loop  structure,  whereas  lowest  level  memory  demand  is  given  by  the  virtual  size  of  the 
lowest  level  locality.  Memory  requirements  in  between  the  outermost  and  innermost  loop  of  a 
program  are  defined  by  the  sizes  of  corresponding  localities. 

The  priority  primitive.  P,  is  used  to  determine  the  sequence  in  which  localities  of  a  given 
construct  should  be  tried  for  allocation.  A  locality  at  level  one  should  be  tried  for  allocation 
before  a  locality  at  level  two.  and  a  locality  at  level  two  should  be  tried  for  allocation  before  a 
locality  at  level  three,  and  so  forth.  Such  precedence  is  motivated  by  the  hierarchy  of  locality 
sizes.  Higher  level  localities  are  larger  in  size  than  lower  ones.  And  the  allocation  of  larger  locali¬ 
ties  is  sought  to  achieve  lower  fault  rates. 

The  priority  primitive  is  used  to  impose  a  lower  limit  on  the  memory  requirement  of  a  pro¬ 
gram.  Such  a  lower  limit  is  given  by  the  virtual  size  of  the  lowest  level  locality.  The  inability  to 
allocate  the  lowest  level  locality  results  in  thrashing.  The  execution  of  a  directive  associated  with 
a  lowest  level  locality  requires  the  allocation  cf  such  locality  e'fen  at  the  expense  of  swapping 


some  processes  out  of  memory.  Thus.  P  is  used  by  OS  to  invoke  SM  when  necessary.  Moreover.  P 
is  used  by  the  partial  swapping  mechanism  as  discussed  in  the  previous  section.  A  process,  running 
with  P>  1.  might  be  selected  for  swapping  by  PS. 

The  value  of  P  can  be  deduced  from  the  relative  position  of  a  locality  in  a  hierarchical  local¬ 
ity  structure.  The  largest  value  of  P  is  defined  by  the  maximum  nest  depth  (A)  of  a  loop  structure 
since  A  imposes  an  upper  bound  on  the  number  of  localities  in  a  given  locality  structure.  Hence, 
the  values  of  P  range  from  1  to  A.  The  outermost  and  the  innermost  loop  compose  an  envelop 
enclosing  all  other  intermediate  localities.  /’  =  !  can,  in  principle,  be  assigned  to  either  one  and  P=  A 
to  the  other.  In  the  previous  section  we  assigned,  by  convention.  P  =  \  to  the  innermost  loop.  The 
motive  behind  this  is  to  enable  OS.  while  processing  a  directive,  to  determine  the  memory  request 
associated  with  the  lowest  level  locality.  This  is  necessary  for  two  reasons.  First,  if  the  current 
memory  request  can  not  be  allocated  and  P=l.  SM  should  be  invoked.  Otherwise  (if  P=*A  is 
assigned  to  the  lowest  level  locality).  A  should  be  known  at  the  time  of  executing  a  directive  to 
compare  it  with  P  every  time  a  request  can  not  be  satisfied.  The  second  reason,  the  value  of  the 
(P.X)  pair  associated  with  the  lowest  level  locality  should  be  stored  in  order  to  partially  swap  a 
process  if  needed. 

In  a  multi-nested  loop  structure,  there  can  be  more  than  one  innermost  loop.  Each  of  these 
loops  forms  a  lowest  level  locality  which  must  be  allocated  if  the  process  is  to  continue  execution. 
The  priority  P=1  is  assigned  to  every  innermost  loop  and  P  =  A  to  the  outermost  loop.  The  priority 
of  any  intermediate  level  takes  the  value  between  2  and  A-l.  Such  value  is  used  to  indicate  how 
many  more  parameters  a  directive  could  have.  In  effect  the  value  cf  P  at  any  level  L,  is  a  measure 
of  the  distance  d  between  L,  and  the  innermost  loop  enclosed  by  L, .  The  priority  P,  assigned  to 
any  loop  L,  can  be  iteratively  evaluated  by  finding  the  maximum  nest  depth  A,  of  an  inner  loop 
enclosed  by  L,  .  assuming  that  L,  is  an  outermost  loop,  and  assigning  P,  =A,.  . 

Assigning  priorities  to  a  loop  structure,  thus,  cannot  be  performed  with  a  single  top  down 
parsing  technique  since  it  is  necessary  to  know  the  depth  of  the  innermost  loop  relative  to  the 
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current  outer  one.  A  single  top  down  parsing  scheme  can  be  used,  however,  if  P-1  is  assigned  to 
the  outermost  loop.  A  procedure  for  assigning  priorities  to  various  loops  in  a  multi-nested  loop 
construct  is  given  in  Algorithm  3-1. 

Algorithm  3-1:  Assign  priorities  to  loop  structures; 

Repeat: 

Step  1:  Parse  until  a  loop  is  encountered. 

Step  2:  Find  the  maximum  nest  depth.  A,  related  to  this  loop. 

Step  3:  Assign  P-A  to  the  current  loop. 

Until  the  end  of  the  program  is  reached. 

An  example  using  Algorithm  3-1  is  shown  in  Figure  3-4. 

3. 1.4.2.  Memory  request  primitive,  X 

The  memory  requirement  of  a  program.  X.  at  a  given  time,  is  determined  by  the  virtual  size 
of  the  current  program  locality  under  execution  and  is  used  as  a  primitive  of  ALLOCATE.  In  this 
study  the  localities  are  restricted  to  those  comprised  by  loop  structures  since  the  study  is  con¬ 
ducted  on  numerical  programs  where  the  locality  structures  can  be  correlated  to  loop  structures 
at  the  source  level  code  [5],  [30]  and  [31].  In  this  section,  the  virtual  size  of  a  program  locality  is 
estimated  using  source  level  information. 

Only  references  to  array  data  structures  are  considered  in  this  study.  The  instructions  code 
and  data  constants  are  assumed  to  be  locked  permanently  in  main  memory.  This  assumption  is 
realistic  since  the  paging  behavior  of  numerical  programs  is  dominated  by  references  to  array  data 
structures  inside  loops  [4],  [30].  Moreover,  the  virtual  size  of  the  instructions  and  the  constants  is 
relatively  small  compared  to  the  virtual  size  of  array  data  structures. 

The  estimation  of  the  virtual  size  of  the  current  locality  utilizes  only  the  informat. on  avail¬ 
able  at  the  source  level  code.  A  wide  range  of  FORTRAN  programs  used  in  different  packages  was 
examined  for  the  purpose  of  identifying  their  localities,  using  the  information  inherent  in 
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Figure  3-4:  Example  of  assigning  priorities _ 

the  source  code.  Some  of  these  packages  are  UIARL  :  University  of  Illinois  Atmospheric  Research 
Lab.  ACM:  ACM  Standard  Programs.  IEEE:  IEEE  Standard  Programs  for  signal  processing:  NRL: 
Naval  Research  Laboratory.  AFWL:  Air  Force  Weapons  Laboratory.  Fishpak.  Eispak.  Minpak. 
Fishpak  is  a  package  of  Fortran  subprograms  for  the  solution  of  separable  elliptic  partial 
differential  equations  developed  at  NCAR  (National  Center  for  Atmospheric  Research).  Eispak  is  a 
package  of  Fortran  subroutines  for  the  analysis  of  standard  and  generalized  eigenvalue  programs. 
Minpak  is  a  package  of  Fortran  subroutines  for  finding  the  minimum  of  solution  squares  ol  sets  of 


nonlinear  equations. 
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Examining  the  source  code  of  these  programs  reveals  that  six  parameters  can  be  used  to  calcu¬ 
late  the  virtual  size  of  a  locality.  Five  of  these  parameters  are  program  dependent  and  one  is  system 
dependent.  The  system  dependent  parameter  is  the  page  size  (P).  The  page  size  is  necessary  for  cal¬ 
culating  the  virtual  size  of  a  locality  in  pages,  since  memory  allocation  is  measured  in  pages.  Pro¬ 
gram  dependent  parameters  are 

( 1 )  Array  size  (Z)  :  I  is  usually  given  as  (M  x  X)  dimension,  where  M  is  the  number  of  rows  and 
N  is  the  number  of  columns.  A  vector  is  an  array  with  X  =  1.  Only  up  to  two-dimensional 
arrays  are  considered  in  this  study.  Array  sizes  are  given  explicitly  in  dimension  declaration 
statements.  The  virtual  size  of  an  array  (5.a  )  is  given  by 

c  _(MxN) 

Sa - P 

assuming  that  each  array  element  is  one  word  long.  The  virtual  size  of  all  arrays  referenced 
in  a  program  comprise  an  upper  bound  on  its  memory  requirements.  The  memory  require¬ 
ments  during  the  execution  of  a  loop  structure  are  bounded  by  the  virtual  size  of  the  arrays 
referenced  inside  the  structure. 

(2)  The  nest  depth  of  a  loop  structure  (A):  A  determines  whether  the  current  locality  has  a 
hierarchical  structure  or  not.  The  value  A>1  implies  that  a  hierarchical  locality  structure 
with  utmost  A  levels  may  exist.  It  is  possible,  however,  not  to  have  a  hierarchical  locality 
structure  with  A>  1.  For  example,  a  doubly  nested  loop  (A  =  2)  with  arrays  referenced  in  a 
row  major  order  inside  the  inner  loop  forms  a  single  locality  of  level  one.  The  nest  depth  is 
also  useful  for  assigning  priority  indexes  to  nested  loops. 

<3)  The  number  of  indexed  variables  used  to  reference  the  elements  of  an  array  (X):  X  is  used  to 
give  an  upper  bound  on  the  number  of  distinct  array  pages  referenced  at  a  given  locality  level. 
The  maximum  number  of  array  elements  which  can  be  referenced  during  one  iteration  ol  a 
loop  is  determined  by  the  number  of  distinct  indexes.  X.  used  to  address  the  array.  11  the 
array  elements  referenced  at  a  particular  level  are  stored  in  distinct  pages,  then  ,\  distinct 
pages  are  referenced  at  this  level.  Depending  on  the  dimension  and  the  order  of  reference  ol 
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an  array.  N  can  be  used  to  give  an  upper  bound  on  the  number  of  array  pages  which  partici¬ 
pate  in  the  formation  of  the  locality  at  the  current  level. 

(4)  The  order  in  which  arrays  are  referenced  has  a  direct  effect  on  the  formation  of  a  locality.  If 
an  array  is  referenced  in  the  same  order  as  the  elements  are  stored  in  the  virtual  storage,  then 
each  array  referenced  inside  a  loop  contributes  to  the  locality  comprised  by  that  loop.  On  the 
other  hand,  if  the  elements  of  an  array  are  referenced  in  a  different  way  than  that  of  the 
storage  scheme,  then  references  fall  across  pages.  In  this  study  we  have  assumed  a  column 
major  order  scheme.  The  elements  of  the  first  column  are  stored  sequentially  in  the  same  page. 
If  the  number  of  elements  in  a  column  exceeds  the  number  of  words  in  a  page,  a  second  page 
is  used,  and  so  on  until  all  elements  of  a  column  are  stored.  Then  the  elements  of  the  second 
column  of  the  array  are  stored  in  the  same  manner,  until  all  columns  have  been  stored. 

An  array  is  said  to  be  referenced  in  a  column  major  order.  .4"  ,  if  the  column  index,  J  .  is 
fixed  and  the  row  index.  1 .  varies  during  the  execution  of  a  loop.  L,  .  The  addresses  generated 
by  references  to  Ac  fall  into  adjacent  virtual  space  locations.  Once  a  page  is  addressed,  its  ele¬ 
ments  will  be  referenced  sequentially  until  a  second  page  is  addressed.  When  a  second  page  is 
referenced,  the  first  page  will  no  longer  be  active.  Therefore,  only  one  page  from  Ac  will  be 
active  at  a  time.  However,  if  several  row  indexes  are  used  to  reference  a  column's  elements, 
several  pages  might  become  active  during  the  execution  of  L,  .  Hence,  the  locality  comprised 
by  L.  may  consist  of  several  pages,  depending  on  the  number  of  row  indexes  used  in  combina¬ 
tion  with  a  column  index.  It  is  also  possible  that  several  column  indexes.  J  .  could  be  specified 
at  outer  levels.  In  this  case,  the  virtual  space  specified  by  each  column  will  participate  in  the 
formation  of  CL. 

An  array  is  referenced  in  a  row  major  order.  A'  .  if  the  elements  of  a  row  -u?  referenced 
sequentially  during  the  execution  of  a  loop.  L, .  The  row  index  1  is  fixed,  while  J  .  the 
column  index,  varies  inside  L. .  Any  elements  referenced  in  a  row  ma  or  order  are  located  in 
two  different  columns  and.  hence,  in  two  ditferent  pages,  unless  the  column  size  is  less  than 
the  page  size.  Therefore,  the  number  of  pages  to  be  referenced  during  the  execution  of  L  is 


equal  to  the  range  of  the  column  index  J .  A  page  referenced  in  one  iteration  of  L,  would  not 
be  referenced  in  the  next  iteration,  since  the  next  reference  is  made  to  an  element  in  a  different 
column.  Hence,  no  locality  of  reference  exists  at  L,  level.  If  the  virtual  size  of  a  column.  S(  . 
is  less  than  the  page  size,  the  elements  of  two  successive  columns  may  be  stored  in  the  same 
page:  assuming  that  all  pages  of  an  array  are  filled  except,  possibly,  for  the  last  page.  In  this 
case  the  number  of  pages  expected  to  be  referenced  during  the  execution  of  L,  is  equal  to  the 
total  number  of  pages  in  the  virtual  space  of  .4'  .  The  maximum  number  of  pages  referenced 
from  the  virtual  space  of  A'  depends  on  the  size  of  each  column  compared  to  the  page  size. 
Given  a  .4'  with  a  dimension  (Mx.\).  where  M  is  the  number  of  elements  in  a  column  or  the 

range  of  the  row  index  1 .  and  N  is  the  number  of  elements  in  a  row  or  the  range  of  the 

column  index  J  .  the  following  equation  finds  the  maximum  number  of  pages.  X,  ,  that  may  be 
referenced  during  the  execution  of  L, : 

-  M*N 

P  if  Sm<P  (3-1) 

X>  =  iV  if  S„>P 

where  S«/  is  the  virtual  size  of  a  column  and  M*N/P  is  the  virtual  size  of  the  array.  Note 
that  references  to  .4'  do  not  form  a  locality  of  reference  at  the  same  level  of  reference  (L,  ). 

However,  a  macroscopic  view  from  a  higher  level  Z,y  .  where  j=l . i-1  shows  that  a  locality  of 

reference  results  from  references  to  .4 '  at  L, . 

(5)  The  level  (or  nest  depth)  at  which  an  array  is  referenced  (\):  \  =  1  is  the  nest  depth  of  the 
outermost  loop  in  a  multi-nested  loop  structure.  \  increases  as  we  go  deeper  into  the  loop  nest. 
The  nest  depth  of  the  inner  most  loop.  \  =  A.  is  the  maximum  nest  depth  of  a  loop  structure. 
The  smaller  the  value  of  A.,  the  higher  is  the  level.  A  row-wise  referenced  array  at  some  level 
\—i  does  not  form  a  locality  at  this  level.  However,  if  there  exists  a  higher  level  \ < i  .  then 
A  '  forms  a  locality  at  all  levels  with  \  <i  .  Thai  is  because  the  virtual  space  from  A '  .  refer¬ 
enced  at  X—i  .  is  rereferenced  repeatedly  during  '.he  execution  of  any  higher  level  loop  with 
A  <i  The  entire  virtual  space  of  .4  '  .  S .  .  is  referenced  during  each  iteration  :f  any  loop  vc  ith 


level  A=1.2...i  — 2.  Therefore.  A'  tends  to  form  a  locality  at  higher  levels  \  <i  with  a  size 


X,  _!  given  by  equation  3-1  for  the  locality  at  level  A=i  — 1  and  a  size  Xj  =SA,  for 


y  =1 .2.  •  •  •  .i  -2. 


Similarly,  for  the  case  of  a  vector,  one  iteration  of  a  higher  level  loop  A —j  is  sufficient  to 


span  the  entire  virtual  space  of  all  vectors  referenced  at  lower  levels.  A>y.  Therefore,  the 


entire  virtual  space  of  a  vector  referenced  at  level  A=i  .  i  ^1  contributes  to  all  higher  level 


localities,  A  <i  . 


In  the  case  of  a  column-wise  referenced  array  inside  a  loop  at  level  A=i  .  one  or  more  columns 


of  an  array  are  spanned  during  the  execution  of  L,  loop.  These  columns  are  usually  specified 


by  an  outer  loop  with  level  \<i .  The  entire  virtual  space  of  Ac  is  spanned  during  one  itera¬ 


tion  of  a  loop  at  level  A=1.2 i—  2.  Thus,  the  entire  virtual  space  of  a  column-wise  refer¬ 


enced  array  contributes  to  localities  formed  at  least  two  levels  higher  than  the  level  at  which 


the  array  is  referenced. 


Next,  the  above  parameters  are  used  in  a  more  quantitative  manner  to  evaluate  the  contribution  of 


vectors  and  arrays  to  a  locality  structure.  For  the  convenience  of  the  analysis  the  cases  of  vectors 


and  arrays  are  treated  separately. 


Vectors 


A  vector  (V)  is  actually  a  matrix  (M  x  1)  with  M  rows  and  one  column.  Memorv  locations  in 


which  the  elements  of  a  vector  are  stored  constitute  the  vector's  virtual  space.  The  elements  of  a 


verier  are  stored  sequentially  in  a  page  until  the  page  is  filled,  and  then  a  second  page  is  used,  and 


so  on  until  all  the  elements  of  a  vector  are  completely  stored  in  the  virtual  storage.  A  page  contains 


only  the  elements  of  one  vector.  I homogeneous  storage).  The  virtual  size  of  a  vector  (S,  )  is  defined 


5  =M/P 


he'-e  P  is  the  page  size  and  M  is  the  number  of  elements  in  the  vector,  assuming  that  each  element 


is  one  vv  .rd  Ions’ 
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Assume  that  a  vector.  V,  .  is  referenced  inside  a  loop  at  a  nest  depth  A=i.  as  shown  in  Figure 
3-5.  and  V,  's  elements  are  referenced  through  the  indexed  variables  specified  by  the  current  loop. 
L, .  The  locality  comprised  by  loop  L,  is  the  current  locality  (CL).  A  vector  V,  contributes  to  the 
CL  as  well  as  to  all  higher  level  localities  comprised  by  outer  loops,  L\.L2.  •  •  X,-i  with  nest 
depth  A=1.2,  ■  ■  ■  i  — 1,  respectively. 


Consider  the  contribution  of  V,  to  the  locality  at  level  t— 1.  Assume  that  Z,,_j  has  a  range  of 
iV,  iterations.  Each  iteration  of  L,  _!  involves  a  full  execution  of  L, .  The  virtual  space  of  V,  is 
spanned  totally  in  the  time  duration  of  L, .  Therefore,  the  virtual  space  of  V,  will  be  spanned  N,  _i 
times  (the  of  L,^).  Similarly,  at  level  A=i— 2  with  A',_ 2  iterations,  the  virtual  space  of  V,  will  be 
spanned  iV,  _2XA',  _j  times,  thus  forming  a  locality  at  level  A=i-2.  The  same  analysis  applies  to  all 
higher  level  localities.  Therefore,  a  macroscopic  view  of  the  virtual  space  of  V,  from  any  level 
\=1.2.....i  —1  shows  that  the  virtual  space  of  V',  is  being  referenced  repeatedly. 

In  general,  a  vector  V ,  referenced  at  the  current  locality  level,  L, .  contributes  to  localities  at 

higher  levels  L.  .;  =1.2 . i  —1  with  its  entire  virtual  size.  S,  .  For  K  vectors  referenced  at  CL.  each 

vector  contributes  to  all  higher  level  localities  with  its  virtual  size  S,  .  Let  X,  be  the  size  of  the 
current  locality  and  Xy  .  where  i <  i .  is  the  size  of  higher  level  localities  with  nest  depth  X —  j. 
Using  these  notations  the  contribution  of  all  vectors  referenced  at  CL  to  all  higher  level  localities  is 
calculated  as  follows: 
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A  '(i  -  j  )l 
A  '  ( j  .:  )L 


Figure  3-5:  Loop  structure  example 
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+  ts,t  (3-2) 

<t  =i 

where  j«  1,  2 . i-1  and  S,k  is  the  virtual  size  of  the  k'h  vector.  K  is  the  number  of  different  vec¬ 

tors  referenced  inside  the  CL.  For  illustration  consider  Example  3-3. 

Example  3-3: 

Dimension  V  1(1000),  V2(1000) 

DO  101=1,1000 
Do  20  J=  1,1000 
V  1(  J)=V  l(  J)+V2(I) 

20  CONTINUE 
10  CONTINUE 

Assume  that  the  page  size  is  P=100  words  and  each  vector  element  is  one  word.  The  virtual  size  of 
V  1  and  V2  is  1000/100  =  10  pages  each.  The  code  in  Example  3-3  adds  the  sum  of  the  elements  of 

V2  to  each  element  of  VI.  Each  element  from  the  virtual  space  of  VI  (Vl(l),  Vl(2) . 

Vl(lOOO))  is  added  to  one  element  from  V2  (V2(D)  during  the  execution  of  Loop  20.  All  the  ele¬ 
ments  of  VI  are  referenced  during  the  execution  of  loop  20.  while  only  one  element  of  V2  is  refer¬ 
enced.  In  other  words,  the  entire  virtual  space  of  VI  (10  pages)  will  be  referenced  by  the  lime  loop 
20  completes  1000  iterations.  These  10  pages  will  be  referenced  again  when  Loop  10  executes 
another  iteration.  By  the  time  Loop  10  iterates  1000  times,  the  virtual  space  of  VI  will  have  been 
spanned  1000  times.  Hence.  VI  contributes  to  the  locality  comprised  by  loop  10  witn  Sri=10 
pages.  Therefore,  if  the  first  level  locality  comprised  by  loop  10  is  to  be  considered  for  allocation, 
then  at  least  10  pages  must  be  allocated  in  order  to  avoid  replacing  Vi's  pages  during  the  execution 
of  Loop  20.  The  contribution  of  Vl  and  V2  to  the  locality  comprised  bv  Loop  20  is  discussed  next. 

The  contribution  of  V,  to  the  virtual  size  of  CL  is  determined  by  the  number  of  distinct  vec¬ 
tor  elements  referenced  during  one  iteration  of  the  current  loop.  The  number  of  distinct  elements 
referenced  at  level  L.  is  determined  by  N.  the  number  of  distinct  indexed  variables  used  to  refer¬ 
ence  vector  elements  The  distinct  elements  of  a  vector  referenced  by  N  indexes  can  be  stored  in 
utmost  N  pages,  depending  on  the  virtual  size  of  a  vector  and  the  distribution  of  \  over  the  vector 
elements.  In  Example  3-3.  one  index.  J  .  is  used  to  reference  VI  elements  inside  loop  20  and  I  is 
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used  to  reference  elements  in  the  virtual  space  of  V2.  There  are  only  two  elements  (Vl(J)  and 
V2(D)  referenced  during  each  iteration  of  loop  20.  Consider  the  first  iteration  of  Loop  10.  /  =1.  and 
the  execution  of  loop  20  (J=1.1000).  A  reference  made  to  V2(l)  is  translated  to  a  reference  to  the 
virtual  address  where  the  first  element  of  V2  is  stored.  In  effect,  a  reference  is  made  to  the  first 
page  in  the  virtual  space  of  V2.  P \{V  2).  The  first  page.  P \{V  2).  which  contains  the  first  100  ele¬ 
ments  of  V2.  remains  active  during  the  execution  of  Loop  20  (1000  iterations),  since  the  index  I 
varies  only  at  the  level  of  Loop  10.  A  reference  made  to  V 1  ( J )  is  translated  to  the  virtual  address 
of  the  page  containing  the  element  V1(J).  depending  on  the  value  of  J.  For  example,  the  first  100 
references  are  made  to  the  first  page  P ,(V  1).  The  next  100  references  (100<J<200)  are  made  to 
P2(V  1).  and  so  on  until  1)  is  referenced.  Note  that  when  a  new  page  is  referenced,  the  old 

one  will  no  longer  be  referenced  until  loop  20  is  reinitiated  by  loop  10.  Therefore,  during  the  execu¬ 
tion  of  loop  20.  VI  needs  only  one  page  to  be  allocated  in  memory  and  so  does  V2.  Any  extra  allo¬ 
cation  is  redundant. 

In  general, if  the  virtual  size  of  a  vector  is  less  than  the  page  size,  the  vector  contributes  with 
one  page  to  X,  .  However,  if  the  virtual  size  of  a  vector  is  larger  than  the  page  size.  5,  >1.  the 
number  of  distinct  vector  elements  referenced  at  CL  comprises  an  upper  bound  on  the  number  of 
distinct  pages  that  could  be  referenced  at  CL.  The  number  of  distinct  vector  elements  referenced  at 
CL  is  determined  by  the  number  of  distinct  indexes.  N.  used  to  reference  a  vector. 

Figure  3-6  shows  a  memory  representation  scheme  of  a  vector.  The  indexes  I \.I2.  V.v  are 
used  to  reference  distinct  elements  in  the  form  V (/ j).V (/;) . V  (7V).  The  number  ot  distinct 
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1  .gure  3-6:  A  vector  s  memorv  representation 


indexed  variables,  N.  is  used  to  determine  the  maximum  number  of  pages  that  might  become  active 
during  the  execution  of  CL.  Such  active  pages  constitute  the  body  of  CL,and  X  is  the  virtual  size  of 
the  locality.  Therefore.  V,  contributes  to  the  current  locality  size,  with  the  number  of  distinct 
indexes.  X  pages,  or  the  vector's  virtual  size,  whichever  is  less.  Consequently,  the  memory  request 
primitive.  X ,  is  given  by 


tV+1  if  N  >SV 

I 

X,  =  X,  +  s  N  ^  s 


A  vector  is  allocated  X+l  pages,  although  the  active  set  of  pages  contains  only  X  pages.  The  extra 
page  is  used  as  a  buffer  to  allocate  a  newly  referenced  page  after  X  pages  have  already  been  refer¬ 
enced.  Buffering  the  new  page  avoids  immediate  replacing  one  of  the  active  X  pages.  Since  the  local¬ 
ity  of  a  program  contains  only  X  pages,  one  of  the  allocated  X+l  pages  will  be  idle  and.  hence,  will 
be  a  candidate  for  replacement  if  a  new  page  is  referenced.  The  underlying  assumption,  here,  is  that 
a  least  recently  used  (LRU)  or  a  similar  replacement  policy  is  used. 

In  general,  if  there  are  K  vectors  referenced  inside  a  loop  L,  .  then  the  memory  requested  to 
allocate  A"  vectors  is  given  by 


X,  =X,  +  ZXV  (3-4) 

I  =*> 

In  Example  3-5.  each  vector  has  10  pages  in  its  virtual  space.  At  the  first  level  (loop  10)  both  vec¬ 
tors  need  to  be  allocated  entirely  since  they  are  referenced  at  the  lower  level  locality  (Loop  20). 
Hence,  the  memory  allocation  primitive  at  the  first  level  is  \=20  pages.  The  directive  inserted  at 
the  beginning  of  Loop  10  would  be  of  the  form  ALLOCATE  (2.20).  At  the  second  level,  there  are 


Example  3-5: 


Dimension  Vl(  1000),  V2(  1000) 
DO  10  J=  1,1000 
DO  20  1=1,1000 
V1(I*2);  V1(I) 

V2U);  Y2(I+1);  V2(J); 

20  CONTINUE 
10  CONTINUE 


two  indexes.  I  and  1*2,  used  to  reference  two  elements  in  the  virtual  space  of  VI;  hence.  N=2  and 
utmost  two  memory  pages,  from  the  virtual  space  of  VI.  are  active  during  the  execution  of  Loop 
20.  Since  N'=2  is  less  than  Svi=10.  the  memory  requested  to  allocate  VI  is  Xv  1=2+1=3  pages.  Note 
that  if  only  two  pages  were  allocated  to  VI.  a  page  will  be  replaced  every  50  iterations  and  then 
faulted  during  the  next  iteration  of  the  loop.  Such  extra  faults  are  avoided  by  using  the  extra 
buffering  page. 

Three  indexes.  /  .  /  +1.  J  .  are  used  to  reference  three  elements  in  the  virtual  space  of  V2.  The 
three  referenced  elements  could  be  stored  in  utmost  three  pages,  N‘*=3.  Since  N<5\  2-  the  memory 
required  to  allocate  V2  is  .Yr2=3+1=4  pages.  The  total  memory  space  required  at  the  second  level 
is  \=3+4=7  pages,  and  the  directive  at  this  level  has  the  form  ALLOCATE  (2.20)  else  (1.7).  Note 
how  ALLOCATE  prefers  the  allocation  of  20  (the  entire  virtual  space  of  VI  and  V2)  over  7.  How¬ 
ever.  if  20  pages  cannot  be  allocated.  7  pages  are  enough  to  avoid  thrashing  while  loop  20  is  in  con¬ 
trol  of  CPU. 

Equations  3-3  and  3-4  are  incorporated  into  a  data  structure  constructed  at  compile  lime  to 
estimate  the  memory  requirements  of  a  program.  The  construction  method  of  such  a  data  structure 
is  discussed  later  in  this  section. 

Twodimensional  arrays 

Depending  on  their  referencing  order,  arrays  can  be  referenced  in  a  column  major  order 
(column  wise  referenced  arrays  .-4C)  or  in  a  row  major  order  (row  wise  referenced  arrays  .4'  ). 
Both  types  are  discussed  in  the  following  subsections. 

Column  wise  referenced  arrays 

Consider  in  Figure  3-4.  the  column  wise  referenced  array  A'  (i  .j  )  at  level  I.  The  column 
index  of  .\  .  J  .  remains  unchanged  during  the  execution  of  L  :  J  is  specified  at  higher  levels 
L  \.L2.  "  -i-  The  value  of  the  row  index.  /  .  changes  its  value  at  L,  level.  Array  elements  are 

referenced  in  the  form  A  ( l  J  ).  During  the  execution  of  L  ,  elements  stored  ;n  the  ■■  irtu.ii  space  ol 
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a:  Column  Size  ^  Page  Size 
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Figure  3-7:  Column  wise  referenced  arrays 

a  column  7  are  addressed  using  one  or  more  row  indexes.  /  j,7 2.  •  •  • .  Array  elements  A(/i.J). 
A(/,.J).  ...  A U\j  ,J)  could  be  stored  in  one  page  if  the  column  size  (Sf- )  is  equal  to  or  less  than  the 
page  size  (Sc  ^  P)  (Figure  3-7a)  or  in  several  pages  if  Sc  >  P  (Figure  3-7b). 

In  the  first  case,  no  matter  how  many  row  indexes  are  used  to  designate  a  particular  element, 
only  one  page  could  be  referenced  during  the  execution  of  L,.  In  the  second  case,  several  pages  in 
the  virtual  space  of  a  column  could  be  referenced  during  one  iteration  of  L, .  Consequently,  the 
number  of  row  indexes.  *V; .  used  in  combination  with  a  particular  column  index  7  determines  the 
number  of  active  pages  from  the  virtual  space  of  A  0  in  the  time  duration  of  Li .  Obviously,  if  the 
number  of  pages  present  in  the  virtual  space  of  a  column  is  less  than  the  number  of  row  indexes. 
i.e,5<-  < .V; .  the  entire  virtual  space  of  a  column  is  active. 

Consider  Example  3-6.  where  array  A  is  referenced  in  a  column  major  order  inside  the  inner¬ 
most  loop  of  a  doubly  nested  loop.  The  virtual  size  of  A  is  SA  =  1000X-^jj|j-  =  1000  pages,  where  the 

page  size  P=100  words.  The  virtual  size  of  each  column  is  Sc  =  =  1 0  pages.  The  sequence  of 

addresses  generated  during  the  execution  of  Loop  20  is  shown  in  Figure  3-8  for  J=l.  A  reference  to 

A ( 1 . 1  >  is  translated  into  a  reference  to  Pi  for  1^]<100.  P2  for  100^1  <200 .  and  to  P5  for 

400^1  <500:  i.e..  a  new  page  is  referenced  every  100  iterations  of  Loop  20.  Similarly  references  to 
A(  1*2.1)  generate  references  to  a  new  page  every  50  iterations  of  Loop  20:  i.e..  PI  for  1  ^1<50.  P2 
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Example  3-6: 


DIMENSION  A(  1000, 100) 
DO  10  J=  1,100 
DO  20  1=1,500 
A<UV,  A(I*2J); 

20  CONTINUE 
10  CONTINUE 


A(i.j)  I 


300  4( 

-1 - P4 - ' 


n  i  50  100  150  200  250  3Q0  350  4Q0  450  5( 

u  ^pr4_pT  T  ft  '*  Pa  t  P5  r  p6  1  p?  w  '  pg^mr 


Figure  3-8:  Virtual  address  sequence  for  A 


for  50^1  <100.  P3  for  100^I<150.  ...  and  P10  for  450^1  <500.  Figure  3-8  shows  that  two  pages 
remain  active  in  the  time  duration  of  Loop  20.  except  for  the  time  interval  1<50.  These  pages  are 
determined  by  the  indexes  I  and  1*2.  In  principle  both  elements  designated  by  I  and  1*2  could  be 
stored  in  the  same  page  and.  therefore,  the  same  page  will  be  referenced  twice,  or  in  two  different 
pages  and.  therefore,  two  distinct  pages  will  be  active.  In  effect,  the  number  of  row  indexes  used  in 
combination  with  a  particular  column.  J  .  gives  an  upper  bound  on  the  maximum  number  of  pages 
that  could  be  active  during  one  iteration  of  L,  (Loop  20  in  our  example).  The  set  of  active  pages 
and  their  time  intervals,  derived  from  Figure  3-8,  is 

KP1:  1  <50).  (P1.P2:  50^I<  100) . (P5.P10:  450<I<500)| 

Naturally,  more  than  one  column  of  A"  could  be  referenced  inside  L, .  In  this  case,  the 
number  of  active  pages  is  defined  for  each  column.  J,  .  by  finding  the  number  of  row  indexes  used 
in  combination  with  7,  .  The  maximum  number  of  active  pages  from  the  virtual  space  of  an  arrav  is 
lound  by  summing  up  the  numbers  found  for  each  column.  If  the  total  number  of  active  pages 
determined  in  this  manner  exceeds  the  number  of  pages  present  in  the  virtual  space  of  an  array,  -he 
irtual  size  ol  the  array  defines  the  set  of  active  pages  at  the  current  execution  level. 


JV  tw.  I 


The  referencing  behavior  of  Ac  resembles  that  of  a  vector.  In  fact,  an  (MxN)  array  referenced 
in  a  column  major  order  can  be  viewed  as  a  set  of  X  vectors,  each  vector  containing  M  elements. 
The  memory  required  to  allocate  the  active  pages  of  a  column.  Xc  is  given  by 


Nj  +1  if  N,  <SC 
Xc  =  Sc  if  N,  >SC 


(3-5) 


The  extra  page  *Nj  +1"  is  used  to  avoid  replacing  active  pages  when  a  new  page  is  activated  as  dis¬ 
cussed  earlier.  Memory  requirements  of  a  column  wise  referenced  array.  X4c .  is  defined  as  the  sum 
of  the  memory  requirements  defined-for  each  column,  or 


xx  =  £xCj 


(3-6) 


where  X  is  the  number  of  columns  addressed  at  level  L, :  j  - 0  means  that  no  array  is  referenced  in 
a  column  major  order.  In  general,  if  there  are  K  arrays  referenced  in  a  column  major  order,  the 
memory  required  to  allocate  these  arrays  at  the  current  level  of  execution.  L, .  is  given  by 


X,  =  *,  +  ZxA< 

i  *0 

where  Xv-  is  the  memory  requirement  of  the  k'h  column  wise  referenced  array. 


(3-7) 


Next  we  evaluate  the  contribution  of  a  column  wise  referenced  array  to  higher  level  localities. 
A  column  wise  referenced  array  contributes  to  all  higher  level  localities  of  levels  Z.;  where  j  =  1.  2. 
....  i-2  with  its  entire  virtual  size.  The  A c  's  contribution  to  the  next  higher  level  locality.  L,  is 
similar  to  its  contribution  to  CL.  comprised  by  L,  .  because  the  virtual  space  of  A'~  is  referenced 
only  once  during  the  execution  of  j.  Whereas,  at  higher  levels.  Z.J.Z,;.  •  •  the  virtual 

space  of  A'  is  entirely  referenced  at  least  once  during  each  iteration  of  any  Lt  loop,  where 

y  =1.2 . i—2.  The  memory  request  primitive  at  higher  levels,  defined  by  column  wise  referenced 

arrays  is  given  by 


V  -  V  +  £  •', 

i  =1 


(3-8) 


w  here  j  =  1.  2 . i-2  and  K  is  the  number  of  different  arravs  referenced  inside  l.  .  S .  is  the  virtual 

J  -  •  .\k 

size  of  the  C'1  arrav. 
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Example  3-7  is  used  to  further  explain  the  process  of  calculating  the  virtual  size  of  a  locality  “*> 

comprised  bv  arravs  referenced  in  a  column  major  order.  Two  arravs  ( A 1  and  A2)  are  referenced  g| 

Ct 

in  a  column  major  order  inside  the  innermost  loop  of  a  triply  nested  loop. 

Example  3-7:  v, 

Dimension  Al(  100,100),  A2(400, 100)  S 

DO  10  K  =  1,  10 

DO  100  J  =  l,  100  _ 

DO  1000  I  =  1,  400 

A  1(1  J ) ;  AKI+U);  AKIJ+2);  Al(I+l,J+2); 

A2(U) ;  A 2(1*24);  A2(I,J+5);  A2U+2J+5);  A2(M-U+5>,  -V 

1000  CONTINUE 

100  CONTINUE  K\ 

10  CONTINUE  ■' 


A1  and  A2  are  referenced  inside  Loop  1000.  The  contribution  of  A1  and  A2  to  the  localities  defined 
at  level  one  (Loop  10).  level  two  (Loop  100).  and  level  three  (Loop  1000)  is  evaluated,  using  Equa¬ 
tions  (3-5)  through  (3-8).  The  virtual  sizes  of  A1  and  A2  are  given  by 


c  100X100  .  _  400X100  ... 

Sa  1- - jqq — =64  pages  and  S42= - ^ — =256  pages 

and  the  virtual  size  of  each  column  of  A1  is  5(  i  ^100/100=1  page  and  each  column  of  array  A2  is 

stored  in  Sr  2=400/ 100=4  pages.  The  memory  requirements  of  A1  and  A2  at  loop  1000  level  are 

found  as  follows.  For  Al.  two  columns  are  referenced  inside  Loop  1000.  Since  each  column  has 

only  one  page  in  its  virtual  space,  there  could  be  only  one  active  page  in  the  virtual  space  of  J  and 

J-2  during  the  execution  of  Loop  1000.  Therefore,  the  memory  requested  to  allocate  Al  is  equal  to 


the  number  of  referenced  columns,  or  XA  j  =  2  pages. 


For  A2.  each  column  occupies  4  pages.  And  there  are  2  columns  referenced  inside  Loop  1000 
v  J  and  J-5).  Two  elements  n  the  virtual  space  of  column  J  are  designated  by  the  row  indexes  1  and 
1*2  (.V-  =2).  The  memory  required  to  allocate  both  active  pages  of  ./12  is  given  by  Equation  (3-5): 
-V-  =  .V  +1=3  pages.  And  three  elements  are  referenced  from  the  virtual  space  of  column  J-5. 

These  elements  are  specified  by  the  row  indexes  1.  1-2.  and  M-I.  The  maximum  number  of  active 
pares  from  the  virtual  space  of  J-5  is  given  by  .V-  =  3.  Hence,  the  memory  space  required  to  allo¬ 
cate  these  active  pages  is  given  by:  X •  =  3+1  =4  pages.  The  total  number  of  pages  required  to 
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allocate  A2  at  the  lowest  level  (Loop  1000)  is  XA2  ~  3+4  =7  pages.  Finally,  the  memory  require¬ 
ment  of  A1  and  A2  at  the  lowest  level  (Loop  1000)  is  X3  =  7+2  =9  pages,  where  X3  is  the 
memory  request  primitive  of  ALLOCATE  associated  with  Loop  1000. 

When  Loop  100  reiterates,  a  new  set  of  columns  from  the  virtual  space  of  A1  and  A2  is 
specified  anti  the  addressed  virtual  space  will  change  accordingly.  However,  the  amount  of  memory 
required  at  this  level  does  not  change.  Therefore,  the  value  of  X  at  this  level  is  also  9  pages 
(X2  =  9).  At  the  first  level  (loop  10)  the  locality  size  consists  of  the  entire  virtual  si2es  of  A1  and 
A2.  During  each  iteration  of  the  first  level  locality  (loop  10)  the  virtual  spaces  of  A1  (100  pages) 
and  A2  (400  pages)  are  totally  referenced.  At  this  outer  level,  all  500  pages  will  have  been  refer¬ 
enced  10  times  by  the  time  Loop  10  completes  execution.  Therefore,  the  memory  requirement  at 
level  one  is  given  by  X  3  =  100+400  =  500  pages. 

Finally.  ALLOCATE  directives  with  both  primitives,  priority  and  memory  request,  are 

inserted  in  the  code  of  Example  3-7: 

Dimension  Al(  100,100),  A2(400,100) 

ALLOCATE  (3,500) 

IX)  10  K  -  1,  10 

ALLOCATE  1 3,500 )  else  (2.9) 

DO  100  J  =  1,  100 

ALLOCATE  (3,500)  else  (2,9)  else  (1,9) 

DO  1000  I  -  1,  400 

Al(u) ;  AKI+1J);  Al(I,J+2);  AKI+1J+2); 

A2(IJ) ;  A2(I*2J);  A2(IJ+S);  A2(I+2J+5);  A2(M-I,J+5); 

1000  CONTINUE 
100  CONTINUE 
10  CONTINUE 

Row  wise  referenced  arrays 

A  memory  representation  scheme  of  a  row  wise  referenced  array.  A'  .  is  shown  in  Figure  3-9. 

A  row  index  I,  U  =1 . m  )  of  A '  remains  unchanged,  during  the  execution  of  CL  in  which  A’  is 

referenced,  whereas  the  column  index  J  changes  its  value  within  the  range  and  J\-  where  \  is 
the  number  of  columns  (the  second  dimension  of  the  array).  In  Figure  3-9.  the  arrows  point  to  the 
direction  in  wmch  t  he  elements  are  referenced.  With  /,  being  fixed,  the  elements 


Figure  3-9:  Row  wise  referenced  arrays 

A (/,  .7  j).  •A  (/,  Jx  )  are  stored  across  pages.  If  the  column  size,  specified  by  the  first  dimension 
M  of  the  array,  is  larger  than  a  page  size  (M>P).  then  any  two  elements  referenced  in  a  row  major 
order  are  fetched  from  two  different  pages.  Two  successively  referenced  elements  may  be  stored  in 
the  same  virtual  page  if  the  virtual  size  of  a  column  is  less  than  P  (\1<P). 

The  contribution  of  A'  to  CL.  comprised  by  L,  in  Figure  3-5.  is  determined  by  the  maximum 
number  of  pages  repeatedly  referenced  during  the  execution  of  L, .  Assume  that  the  elements  of  the 
first  row.  7=1.  are  referenced  during  the  execution  time  of  L, .  A  reference  to  the  element  A(l.l)  is 
translated  into  the  address  of  P  y.  The  next  element  A(1.2)  will  be  referenced  during  the  next  itera¬ 
tion  of  L ,  .  assuming  that  7  is  incremented  by  1.  The  page  containing  A(1.2)  is  P„,+ 1  (Figure  3-9). 
Every  next  iteration  generates  an  address  to  a  new  page  in  the  virtual  space  of  A  ’  .  The  referencing 
pattern  at  L,  level  does  not  seem  to  comprise  a  locality  of  reference.  A  referenced  page.  P,  .  may 
not  be  referenced  more  than  once  in  the  time  duration  of  L,  .  unless  the  same  element  is  referenced 
more  than  once  at  the  same  level.  The  fast  changing  index.  7  .  spans  those  elements  stored  in  the 

virtual  spaces  of  columns  7.  where  7  =  1,2 . A7.  Therefore,  ,Y  distinct  pages  are  expected  to  be 

referenced  in  the  time  duration  of  L  .  None  of  the  ,V  pages  remains  active  during  the  execution  of 
/.  .  Such  behavior  is  unfavored  in  a  virtual  memory  system,  since  every  iteration  of  L.  requires  a 
reference  to  the  virtual  storage  to  fetch  a  new  page.  A  newly  fetched  page  proves  to  be  useful, 
most  of  the  time,  onlv  for  that  reference.  Therefore,  if  A'  is  referenced  at  the  outer  most  level. 
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then  it  makes  no  difference  if  the  entire  virtual  space  of  the  array  is  allocated  or  only  one  page  is 
allocated  in  main  memory. 

However,  if  A'  is  referenced  at  level  A>  1,  then  a  locality  of  reference  is  observed  at  higher 
levels.  The  set  of  pages  referenced  during  the  execution  of  L,  could  be  referenced  again  during  the 
next  iteration  of  Z.,_i  if  the  row  index  is  varied  at  this  level.  In  this  case  the  same  set  of  pages 
remains  active  until  the  value  of  1  exceeds  the  page  size  limit.  At  any  rate,  the  number  of 
active  pages  observed  at  Z.,_i  level  is  given  by  the  range  of  the  column  index  J  at  the  lower  level 
L, .  It  has  been  assumed  earlier  that  the  range  of  J  is  N .  In  this  case,  the  number  of  active  pages  at 
L,  _j  level  is  N .  So  far  we  have  considered  that  only  one  row  index  is  being  used  in  combination 
with  the  column  indexes  to  reference  elements  in  the  virtual  space  of  A '  .  However,  several  row 
indexes  could  be  used  in  any  order  In  such  case,  there  could  be  several  rows  of  pages  active  during 
the  execution  of  L,  _i  depending  on  the  relative  location  of  one  row  index  to  another.  For  each  row 
index.  1 .  the  memory  requirement  is  given  by 

_  |R(J)  if  . 

X'  -  SA  if  R(J>Ss.Sj 

where  R(J)  is  the  range  of  J  and  SA  is  the  virtual  size  of  A.  In  general,  if  several  row  indexes  are 
used,  the  memory  required  to  allocate  X  is  given  by 

Xr=ftXI  (3-10) 

/=  1 

where  A/  is  the  k,h  array  referenced  in  a  row  major  order,  and  R  is  the  number  of  row  indexes 
used  at  level  L,  .  For  K  arrays  referenced  at  L,  level,  the  memory  requirement  X",  is  given  by 

■V,  =  X,  +  £xv  and  X,_,  =  X,  ■  (3-11) 

i  =i 

All  the  elements  of  A'  will  get  to  be  referenced  in  the  time  duration  of  which  includes 
multiple  executions  of  L.  .  By  this  time,  all  pages  in  the  v  irtual  space  of  A '  will  have  been  refer¬ 
enced.  However,  oniy  .V  pages  or  several  sets  of  \  pages,  according  to  Equal  ons  (3-9)  and  (3-10) 
remain  resident  in  memory  during  this  time,  where  .V  is  the  range  of  the  column  index  of  A'  .  If 
L  is  enclosed  by  a  loop  at  a  higher  level  I.  then  all  pages  referenced  at  L 


Will  be  referenced 


68 


again  during  the  next  iteration  of  L,  _2.  Therefore,  the  entire  virtual  space  of  A'  contributes  to  the 
locality  size  at  L,  _2  and  to  all  higher  levels  ■  ■  ■  ,L2.LX.  The  contribution  of  K  arrays 

referenced  in  a  row  major  order  to  Xj  at  level  Lj  ,  where  j  =1.2 . i  —2  is  given  by 

K 

Xj  =  Xj  +  Y.SV  (3-12) 

t  =i  ‘  * 

where  A[  is  the  k ,h  row  wise  referenced  array  at  level  L{ .  For  illustration,  consider  Example  3-8. 
where  two  arrays  A  !  and  .4  2  are  referenced  in  a  row  major  order  inside  Loop  1000.  The  nest  depth 

of  Loop  1000  is  A=3.  The  virtual  size  of  A  !  is  SAl  =  1000x^1^- =  100  pages.  And 

Si  =  200X-I22.  =  200  pages.  The  virtual  size  of  each  column  of  A  t  is  Sc  —  -  10  pages. 

2  1 00  *4 1  1 (a) 

->00 

And  S( ^  =  2  pages.  Memory  representation  schemes  of  .4  i  and  .4  2  are  also  shown  in  Exam¬ 

ple  3-8.  The  virtual  space  of  A  j  is  organized  into  10  rows,  each  of  which  contains  10  pages.  The 
virtual  space  of  4:  has  two  rows,  each  of  which  contains  100  pages.  Consider  the  execution 
sequence.  K=t.  1=1.  and  observe  the  reference  pattern  during  the  execution  of  Loop  1000.  J=1.100. 
References  to  .4  i  are  translated  into  addresses  to  the  virtual  space  in  which  the  elements  of  row 
I  =1  and  row  7  =999  are  stored.  This  virtual  space  consists  of  the  first  and  the  last  rows  of  pages. 
During  the  next  iteration  of  Loop  100,  1-2.  the  same  set  of  pages  will  be  referenced  again.  Refer¬ 
ences  will  continue  to  fall  into  these  pages  until  I>  100.  where  the  second  row  and  the  pre-last  row 
will  be  referenced.  And  so  at  every  100  iterations  of  Loop  100.  a  new  set  of  20  pages  is  referenced. 
Therefore,  the  maximum  memory  requirement  of  .4  ]  at  this  level  is  20  pages.  Or,  as  given  by  Equa¬ 
tions  ( 3-9).  and  ( 3-10).  the  memory  requirements  of  .4  1  at  the  second  level  is  X  2  t  =  10  +  10  =  20 

where  the  range  of  the  column  index  is  10, as  given  in  the  dimension  statement. 

For  ,4  2  there  is  only  one  row  index.  I.  used  at  the  third  level  (Loop  1000).  Hence,  the  number 
of  active  pages  consists  of  one  row  ( 100  pages)  from  the  virtual  space  of  .4  2-  Each  row  will  remain 
active  for  half  of  the  time  duration  of  Loop  100.  Therefore,  the  memory  requested  to  allocate  .4  2  at 
level  2  is  =  100  pages.  Considering  both  arrays.  XA  =  loO  +  20  =  120  pages.  The  execution 
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of  Loop  100  touches  completely  the  virtual  spaces  ci  A  i  and  A  2;  i.e..  all  300  pages  are  referenced 


at  least  once  in  the  time  duration  of  Loop  100. 


Now  consider  the  case  when  Loop  10  continues  its  execution  and  K  is  incremented  by  1. 


K-2.  Ignoring  the  details  of  reference  patterns  at  Loops  100  and  1000,  the  virtual  spaces  of  A 


and  A  2  are  completely  touched  once  more.  This  process  continues  until  Loop  10  completes  execu¬ 


tion.  Observing  the  virtual  space  of  A  i  and  A  2  from  the  first  level,  the  locality  of  reference  seems 


to  cover  all  300  pages  of  A  t  and  A  2.  The  memory  requirement  at  this  level  is  given  by 


X  i  —  200  +  100  =  300  pages.  ALLOCATE  directives  are  inserted  into  the  code  as  shown  in  Exam¬ 


ple  3-8. 


3.1.43.  Data  structure  for  computing  X  at  compile  time 


This  section  presents  a  method  for  computing  X  at  compile  time.  Since  program  localities 


exhibit  a  hierarchical  structure,  a  linked  list  can  be  very  useful  for  representing  localities  at  vari¬ 


ous  levels  of  the  hierarchy.  When  a  loop  is  encountered,  a  new  element  is  added  at  the  head  of  the 


list.  All  data  structures  referenced  inside  a  loop  are  considered  as  part  of  the  record  of  a  recently 


created  element.  When  a  loop  exits,  its  entry  element  in  the  list  is  deleted  and  the  the  contribution 


of  data  structures  to  the  locality  comprised  by  the  exiting  loop  is  evaluated.  Also,  the  contribu¬ 


tion  of  these  data  structures  to  higher  level  localities,  represented  by  all  the  remaining  elements  in 


the  list,  is  evaluated.  The  outermost  loop  is  always  represented  by  the  element  at  the  tail  of  the 


list.  When  this  loop  exits,  the  list  becomes  empty  until  another  loop  construct  is  encountered. 


Just  prior  to  a  deletion  of  an  element  from  a  list,  it  should  contain  the  virtual  size  of  the  locality 


comprised  by  the  exiting  loop.  i.e..  the  memory  request  primitive  X  associated  with  the  current 


locality. 


The  use  of  a  linked  list  data  structure  (LLDS)  facilitates  a  top  down  parsing  strategy  with  a 


back  tracking.  Back  tracking  is  necessary  to  compute  the  contribution  of  data  structures  refer¬ 


enced  at  level  Lx  to  all  previously  parsed  higher  level  loops  L \.L2.  •  •  •  X,_i- 
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Figure  3-10:  Linked  list  data  structure  for  evaluating  primitive  X 

Figure  3-10  shows  the  dynamic  construction  of  (LLDS)  for  evaluating  the  memory  require¬ 
ments  of  the  loops  shown  in  the  figure.  A  current  pointer  (CP)  always  points  at  the  head  of  the  list. 
Eight  parsing  stages  are  shown  in  the  figure.  Each  stage  represents  either  a  beginning  or  an  end  of  a 
loop.  At  stage  A.  the  control  statement  of  the  first  loop  (Ll)  is  encountered.  A  new  element  (XI)  is 
created  at  the  head  of  the  list.  The  current  pointer  points  at  XI  which  will  eventually  contain  the 
value  of  the  memory  requested  bv  ALLOCATE  at  level  Ll,  i.e..  the  virtual  size  of  the  locality 
comprised  by  Ll.  The  second  loop  L2  is  parsed  at  stage  B  and  a  new  entry  X2  is  added  at  the  head 
of  the  list.  Now  CP  points  at  X2,  and  will  continue  to  do  so  until  L3  is  encountered  and  X3  is 
created  and  added  at  the  head  of  the  list.  Loop  L3  exits  at  stage  D.  At  this  stage  the  locality  size 
comprised  by  L3  is  completely  computable  since  all  data  structures  contributing  to  \3  have  been 
parsed.  Also,  the  contribution  of  these  data  structures  to  XI  and  \2  can  be  evaluated  at  this  point. 
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The  record  for  X3,  which  includes  the  data  structures  referenced  inside  L3,  is  deleted  from  LLDS. 


Note  that  an  exiting  loop  does  not  enclose  any  more  loops:  therefore,  its  memory  requirement  is 


fully  determined  when  it  exits.  At  stage  E,  X2  is  computed.  The  contribution  of  data  structures 


referenced  inside  L3  to  X2  has  already  been  determined  when  L3  exited.  Therefore.  L2  is  treated  as 


if  it  were  an  innermost  loop,  although  L2  encloses  L3  as  indicated  by  the  loop  structure.  The  effect 


of  this  technique  is  similar  to  unrolling  L3  and  linearizing  the  nested  structure  at  L2  level.  At  stage 


E.  the  contribution  of  data  structures  referenced  inside  L2  to  Xl  is  evaluated  and  X2  is  deleted 


from  the  list.  CP  now  points  at  XI.  At  this  stage  XI  contains  the  memory  requirements  due  to  L2 


and  L3.  Loop  L4  is  the  only  remaining  enclosed  loop  that  affects  the  locality  at  level  Ll.  Loop  L4  is 


encountered  at  stage  F.  where  a  new  element  X4  is  added  at  the  head  of  the  list.  At  stage  G.  X4  is 


computed  and  the  contribution  of  L4  to  XI  is  found.  Finally,  the  memory  requirement  of  the 


entire  loop  construct  is  evaluated  at  stage  H.  when  Ll  exits. 


The  list  data  structure  described  above  allows  a  single  top-down  parsing  scheme.  However,  a 


back  tracking  mechanism  is  necessary  to  add  the  contribution  of  lower  level  localities  to  higher 


level  ones  because  of  the  hierarchical  nature  of  localities.  Back  tracking  achieves  the  same  effects  of 


unrolling  enclosed  loops  and  linearizing  the  nested  loop  structure.  Moreover,  the  LLDS  technique 


transforms  the  job  of  back  tracking  to  a  simple  scan  of  the  list. 


Each  element  of  the  list  is  a  list  structure  by  itself.  A  graphic  illustration  of  one  element  of 


the  list.  X.  .  is  shown  in  Figure  3-11.  A  record  X  has  two  major  fields,  one  for  vectors  and  one  for 


arravs.  The  arrav  field  has  two  fields,  one  for  column  wise  referenced  arravs.  A"  .  and  the  other  is 


tor  row  wise  referenced  arravs.  A'  .  The  vector  field  has  several  entries,  one  for  each  vector  refer¬ 


enced  at  the  current  level  L,  .  represented  by  X, .  Each  vector  is  described  by  two  attributes:  the 


- ector  variable  identifier  V ,  and  its  virtual  size  S\  .  S\-  is  used  for  evaluating  the  contribution  of 


V  to  higher  level  localities  represented  by  .Y  i-Y? . ,Y.  _t.  Furthermore,  each  vector  is  characterized 


by  a  list  of  distinct  indexes  used  to  reference  l’,  elements  at  /„  level.  The  number  of  entries  in  the 


mdex  list  determines  t he  maximum  number  of  pages  required  to  allocate  V  at  the  current  level. 
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Figure  3-11:  Data  structure  for  evaluating  X 

The  B  field  in  the  V,  record  serves  as  a  boolean  variable.  The  value  of  B  is  set  to  1  if  V ,  is  not 
referenced  at  any  lower  level.  L,  +1.L;  +2,  .  .  .  ,  L.y  •  The  need  for  such  a  boolean  variable  will 
shortly  be  explained. 

Fach  column  wise  referenced  array  has  several  entries,  one  for  each  array  Af.  Each  .4°  is 
described  bv  the  arrav  identifier  .4,  .  its  virtual  size  5j  .  a  boolean  variable  B  similar  to  the  one 
used  lor  vectors,  and  a  list  of  the  columns  referenced  at  the  given  level.  Each  column  in  the  column 
list  is  characterized,  in  its  turn,  by  its  virtual  size  and  a  list  of  row  indexes  used  for  designating 
parti^u'ar  array  elements.  The  contribution  of  any  array  referenced  at  L.  to  X,  is  computed  as 
l  .'llo'vcs.  For  each  column  J,  we  find  the  number  of  entries  .Vy  in  the  list  of  row  indexes  which  is. 
then,  compared  -vith  the  value  of  the  column  virtual  size  S(  stored  in  the  field  of  the  column  index 
record  The  least  of  .V-  and  S,  defines  the  memory  requirement  requested  to  allocate  the  given 
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column.  The  contribution  of  A ,  to  X ,  is  found  by  summing  up  the  values  obtained  for  each 
column  J, .  The  contribution  of  .4,  to  X,  is  also  attributed  to  X, _j.  The  array  A,  contributes  to  all 

higher  level  localities,  represented  by  Xi.X2 . X,  _2-  with  the  value  stored  in  the  virtual  size 

field  in  .4,  record. 

A  row  wise  referenced  array  is  described  by  an  identifier  A, .  the  virtual  size  of  A, .  the 
boolean  variable  B  and  a  list  of  row  indexes  used  at  the  current  L ,  level.  The  value  in  the  virtual 

size  entry  is  attributed  to  the  memory  requirements  XI.  X2 . Xi-2.  The  number  of  entires  in  the 

row  index  sublist  multiplied  by  N  (the  range  of  the  column  index  or  the  second  dimension  of  the 
array)  defines  the  contribution  of  A,  to  X,  .  and  the  next  higher  level.  Xi-1. 

At  any  level  of  the  main  LLDS  list,  there  should  be  only  one  copy  of  any  array  (a  vector  is  a 
onedimnesional  array).  This  restriction  avoids  allocating  memory  to  the  same  array  more  than 
once.  Assume  that  an  array  .4  is  referenced  at  two  levels  L,  and  Lt  where  j  <i :  i.e..  Lj  is  higher 
than  L, .  Data  structures  created  for  .4  at  X,  level  contribute  to  both  X,  and  X ;  ■  Data  structures 
constructed  at  Xy  level  contribute  only  to  X, .  If  data  structures  for  .4  were  kept  at  both  levels, 
then  A  would  be  allocated  more  memory  than  it  actually  requires.  Obviously,  if  the  copy  associ¬ 
ated  with  Xj  is  considered  and  that  associated  with  X,  is  ignored,  then  the  memory  request  X,  .  at 
L,  level,  will  be  underestimated.  Hence,  data  structures  created  for  .4  at  X,  level  should  be  used 
for  computing  .Y,  and  X;  .  Data  structures  at  X,  are  ignored. 

The  boolean  variable  B  associated  with  every  array  referenced  at  any  level  is  used  to  enforce 
the  use  of  one  copy  for  an  array  rule.  When  a  data  structure  is  created  for  an  array  .4  at  level  .Y,  . 
the  boolean  variable  B  is  set  to  1  (5=1).  The  value  of  B  associated  with  .4  at  all  higher  levels 

(,Y[ . X.  _[)  is  reset  to  0  ( B  =0).  The  contribution  of  any  array  with  B  =0  is  ignored,  since  the 

contribution  of  this  array  has  been  accounted  for  at  a  lower  level.  \e~t  a  procedure  is  presented  for 
computing  Y  . 

Procedure  (3-1)  Compute  X; 

BEGIN 

Initialize  LLDS:  LLDS  :=  NIL; 
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Case  of  encountering  a  loop  L,  DO 
BEGIN 

Create  X, ;  {X,  has  two  fields} 

Vector;  list; 

Array:  (column  wise,  row  wise); 

Column  Wise  Arrays;  List; 

Row  Wise  Arrays:  List; 

Initialize  the  list  of  vectors  (VL):  VL:=NIL; 

Initialize  the  list  of  column  wise  referenced  arrays:  CALr=NIL; 

Initialize  the  list  of  row  wise  referenced  arrays:  RAL:=NIL; 

CP  :=  Pointer  to  X, ; 

End; 

Case  of  Parsing  a  vector  V,  (/, )  DO 
BEGIN 
IF  V7,  €  X, 

THEN  Updated  V, 

ELSE  Create  V , ; 

END; 

Case  of  Parsing  an  array  A,  (I,  J  j)  DO 
BEGIN 
IF  A,  is  A c 

THEN  IF  .4,  €  X, 

THEN  Update  A-' 

ELSE  Create  A,c; 

ELSE  IF  A,  €  X,  {A,  is  A'  } 

THEN  Update  A/ 

ELSE  Create  A 

END; 

Case  of  Exiting  a  loop  L,  DO 
BEGIN 
Compute  X. ; 

Compute  the  contribution  of  data  structures  at  X,  level  to  XtX2 . X,_i  levels; 

Reset  5  =0  for  each  V,  and  A,  encountered  at  level  X,  and  any  other  higher  level; 
Delete  X,  from  LLDS; 

END; 

END.  {of  Procedure  J-/} 

Procedure  (3-2)  Create  V\ ; 

BEGIN 

Create  a  new  elment  (V, )  at  the  head  of  the  vector  list  (VL); 

Compute  S-.  ; 

Create  index  List  (IL)  for  V, ; 

Enter  /  into  IL. 

END;  {<>/  procedure  Create  V,  } 

Procedure  (3-3)  Update  V, ; 

BEGIN 

IF  !  is  not  a  member  of  LI  (V,  ) 

THEN  Add  /  to  the  index  list  LI  of  V, ; 

END;  \of  procedure  Update  \\  } 

Procedure  (3-4)  Create  A, ; 

BEGIN 

Create  a  new  element  A.  at  the  head  of  the  list  of  column  wise  referenced  arrays  (CAL); 
Compute  the  virtual  size  of  A  ,  S-  ;  {store  S  ■,  in  A  record) 

Create  a  column  index  list  (CIL); 


Compute  the  virtual  size  of  a  column  Sc  ;  [store  it  in  CIL  record }; 

At 

Enter  the  column  index  7,  into  CIL; 

Create  a  row  index  list  (RIL)  for  J j ; 

Add  the  row  index  7,  to  RIL. 

END;  [of  procedure  Create  A  sub  i  sup  c) 

Procedure  (3-5)  Update  A,c; 

BEGIN 

IF  J  j  €  CIL 

THEN  IF/,  €  RIL(7, ) 

THEN  Skip 

ELSE  Add  element  7,  to  RIL(7;  ); 

ELSE 

BEGIN 

Add  J  j  to  the  list  of  column  indexes  CIL; 

Create  a  row  index  list  RIL  for  7y ; 

Add  7,  to  RIL(7; ); 

END;  [of  ELSE  statement } 

END;  [of  Procedure  Update  A  f  | 

Procedure  (3-6)  Create  A,'; 

BEGIN 

Create  A,  element  at  the  head  of  the  list  of  row  wise  referenced  arrays  RIL; 

Compute  the  virtual  size  of  A, ;  [store  SA  in  .4,  record } 

Create  a  row  index  list  RIL  for  A, ;  {RIL:=NIL} 

Add  the  row  index  7,  at  the  head  of  RIL; 

END;  [of  Procedure  Create  A,’  ) 

Procedure  (3-7)  Update  A’; 

BEGIN 

IF  7,  6  RIL(A.  ) 

THEN  Skip 

ELSE  Add  7,  at  the  head  of  RIL(A, ); 

END;  [of  Procedure  Update  A;  } 

Consider  the  following  notations  and  definitions  which  are  necessary  to  define  a  procedure  for 
evaluating  X,  when  the  corresponding  loop  L,  exits.  The  length  of  a  list  L  is  the  number  of  ele¬ 
ments  in  the  list.  Each  vector  is  associated  with  a  list  of  row  indexes:  the  length  of  this  list  is 

denoted  by  L  ( V,  ),i  =1 . N  .  where  \  is  the  number  of  vectors  or  N  =L  ( VL  ).  Each  column  index 

of  A'  has  a  list  of  row  indexes:  the  length  of  this  list  is  denoted  by  7(7,  ).i  =1 . K  where  A  is 

the  length  of  the  list  of  column  indexes.  A'  =7  (C/7  ).  The  r.umber  of  A'  is  given  by  M  the  length 
of  A  list.  M  —L  (CAL  ).  Each  row  wise  referenced  array  A!  has  a  list  ot  row  indexes:  the  length 

of  this  list  is  denoted  by  7  (A  ).i  =  1 . S  .  where  .V  is  the  length  of  the  list  of  A  ’  .  S  -L  ( RAL  ).  The 

range  of  the  column  index  of  a  row  wise  referenced  array  is  denoted  by  R  (A.  Lsing  these  nota¬ 
tions.  the  lollowing  function  can  be  used  to  compute  X, : 
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X,  =X,  +  X min (L(V,)„Sv-)+ 

i  =i 

+  £  {min  (  £  min(i.  (7^  )„S<-m  )  )  .  SA. 
/» =i  *  =i  "  '* 


+  £min  (  L  (As  )xRc  (As )  .  SAt  ) 


(3-13) 


The  terms  in  Equation  (3-13)  represent  the  contribution  of  vectors,  column  wise  referenced  arrays 
and  row  wise  referenced  arrays  respectively.  This  contribution  is  added  to  what  has  been  already 
stored  in  X,  field,  due  to  contributions  from  lower  levels. 

The  contribution  of  vectors  and  arrays  to  higher  levels  is  given  by  the  following  two  formu- 


(,=*,  +  2>,  +  I  sAr  +  £  sA, 


(3-14) 


1=1  m  =  1  J  =  1 


x,-i  =  x,-i+£sv,  +Q 


(3-15) 


where  Q  is  the  last  two  terms  in  Equation  (3-13). 


3.1.5.  Automatic  insertion  of  ALLOCATE  at  compile  time 

ALLOCATE  is  inserted  just  before  the  beginning  of  each  loop  comprising  a  locality.  The  two 
primitives  of  ALLOCATE.  P  and  X  .  are  computed  and  assigned  to  each  loop  according  to  Algo¬ 
rithm  3-1  for  P  and  Procedure  3-1  for  computing  X.  It  would  have  been  very  simple  to  insert 
ALLOCATE  at  the  beginning  of  each  loop,  once  P  and  X  are  evaluated,  if  ALLOCATE  exhibited  a 
linear  structure.  Because  of  the  hierarchical  structure  of  ALLOCATE,  the  primitives  of  higher  level 
localities  are  carried  into  all  subsequent  lower  level  localities.  Therefore,  the  mechanism  to  be  used 
for  inserting  a  directive  at  a  particular  level  should  be  able  to  memorize  the  primitives  associated 
w  ith  all  levels  enclosing  the  current  level.  The  memory  capacity  should  be  at  least  equal  to  the  nest 
depth  of  the  currently  parsed  loop.  A  suitable  data  structure  for  implementing  such  a  mechanism  is 
a  stack  or  a  linked  list. 
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ALLOCATE  directives  are  inserted  using  a  stack  data  structure  as  follows.  When  a  loop  L,  is 
encountered,  its  primitives  ( P, . X ,  )  are  pushed  to  the  top  of  the  stack.  When  a  loop  L,  exits,  the 
[P, . X ,  )  pair  at  the  top  of  the  stack  is  deleted.  At  any  parsing  level,  a  directive's  parameters  consist 
of  all  the  elements  in  the  stack  ordered  from  bottom  to  top  and  separated  by  the  word  "else."  The 
directive  inserted  at  the  beginning  of  L ,  has  the  form 

ALLOCATE  (Px.Xx)  else  (P2.X2)  else  ■  else  (P,  .X,  ) 

Linked  list  implementation  is  similar  to  stack's,  since  a  linked  list  is  a  form  of  stack.  Besides  simu¬ 
lating  the  hierarchical  natare  of  ALLOCATE,  stack  implementation  facilitates  a  single  top  down 
parsing  scheme  without  backtracking.  Algorithm  3-2  automatically  inserts  ALLOCATE  at  compile 
time  into  a  program's  compiled  code. 


Algorithm  3-2:  Insert  ALLOCATE  at  Compile  Time; 

Initialize  the  directive’s  stack  DS; 

Parse  \unul  the  end  of  the  program) 

Case  of  encountering  a  loop  L,  with  primitives  ( P ,  .X, )  DO 
BEGIN 

PUSH  (P,  .X, )  at  the  top  of  DS; 

FORM  the  directive 

ALLOCATE  lPx.X , )  else  ...  else  IP,  .X,  ) 

{starting  from  the  bottom  of  DS  until  the  top  of  DS} 

INSERT  the  directive  right  before  the  beginning  of  L, ; 

END;  |  of  case  statement } 

Case  of  exiting  a  loop  L,  DO 

DELETE  the  pair  ( P ,X, )  from  the  top  of  the  stack; 

END.  {of  Algorithm  J-3| 

An  example  using  Algorithm  3-2  is  shown  in  Figure  3-12.  A  loop  construct  with  a  maximum  nest 
depth  \=3  is  used  in  Figure  3-12  to  illustrate  the  operation  of  Algorithm  3-2.  The  primitives  P,  ,X, 
are  assumed  to  be  known  for  each  of  the  four  loops.  The  directives  are  inserted  as  shown  at  the 
beginning  of  each  loop.  The  slack  is  updated  upon  encountering  of  a  loop  begin  control  or  end  con¬ 
trol  statements.  When  Loopl  is  encountered,  the  IP x.X  x)  pair  is  pushed  at  the  top  of  the  stack.  The 
directive  at  the  beginning  of  L.oopl  has  the  form  ALLOCATE  (/’1,X1).  Next.  Loop2  is  encountered 
and  IP2.X2)  pair  is  pushed  at  the  top  of  the  stack.  The  directive  at  this  point  has  the  form 

ALLOCATE  IP,.. X  ,)  else  {P2.X2). 

At  stage  3.  Loop3  is  encountered  and  the  pair  l/’j-Y,)  is  pushed  to  the  top  of  -he  stack  The 
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Loop2,  the  pair  (P2.X2)  is  removed.  Loop4  is.  then,  encountered  and  the  pair  ( P 4.X  4)  is  pushed  at 
the  top  of  the  stack.  The  directive  inserted  at  the  beginning  of  Loop4  has  the  form 

ALLOCATE  (?,.Xj)  else  ( P4.X4 ). 

Note  that  Loop  4  is  enclosed  by  Loop  1.  The  pairs  ( P4.XA ),  (P t.X  J  are  deleted  upon  exiting  Loop 
4  and  Loop  1 ,  respectively;  the  stack  remains  empty  until  another  loop  structure  is  encountered. 

3.2.  LOCK  and  UNLOCK  Directives 

LOCK  is  used  to  prevent  particular  pages  from  being  paged  out  of  memory  by  the  replace¬ 
ment  policy.  UNLOCK  is  used  to  relase  these  pages.  LOCK  and  UNLOCK  have  been  used  as  system 
facilities  by  VAX/VMS  and  UNIX  operating  systems.  Abaza  [l]  measured  the  effectiveness  of 
using  LOCK  and  UNLOCK  under  V\1S.  His  results  show  that  the  behavior  of  some  numerical 
algorithms  can  be  drastically  improved,  if  LOCK  and  UNLOCK  under  VMS  are  properly  used. 
However,  in  these  systems  the  problem  of  locking  and  unlocking  particular  pages  is  still  a  user 
rather  than  a  system  problem.  A  user  is  supposed  to  have  adequate  knowledge  of  the  behavior  of 
his  program.  In  particular,  he  should  be  able  to  identify  those  pages  which  are  needed  mostly  in 
memory  so  he  can  order  them  locked  in  memory. 

In  this  study,  pages  to  be  locked  in  memory  are  identified  automatically  at  compile  time.  As 
in  the  case  of  an  ALLOCATE  directive,  the  cases  of  vectors  and  arrays  are  considered  separately.  In 
general,  a  page  mav  be  a  candidate  for  locking  if  it  is  located  in  an  intra-locality  transition  period. 
Intra-localilv  transition  periods  occur  within  a  hierarchical  locality  structure,  whereas  inter- 
locaiity  transitions  occur  between  two  successive  hierarchical  locality  structures.  Lsing  source 
level  code  notations,  inlra-localitv  transitions  are  caused  by  references  to  array  data  structures  in 
between  two  successive  ioop  start  control  statmenls.  Let  L,  refer  to  the  beginning  ol  a  loop  in  a 
multi-nested  loop  structure  and  L. .,  1  refer  to  the  beginning  of  the  next  loop.  Intra-locality  transi¬ 
tion  pages  are  those  pages  referenced  in  between  and  L,  +1. 

A  page  referenced  in  an  intra-locality  transition  period  does  not  contribute  to  localities 
formed  at  the  next  lower  levels  I.  •  •  ■  .  Intra-localilv  transition  pages,  on  the  other  hand. 
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are  included  in  all  higher  level  localities.  L  j.  •  •  •  ,L, .  Further  illustration  is  presented  in  Example 


3-9. 

Example  3-9 

DO  10  i=l.N 
Vl(i) 

DO  100  j-l.\I 
Vl(j) 

100  CONTINUE 
10  CONTINUE 

In  Example  3-9.  a  page  of  vector  VI  designated  by  the  virtual  address  of  V  l(i).  P\u,).  is  refer¬ 
enced  in  the  transition  period  between  Loop  10  and  Loop  100.  This  page  remains  idle  as  long  as 
Loop  100  is  in  execution.  However,  it  is  reactivated  when  Loop  100.  after  M  iterations,  returns  con¬ 
trol  to  Loop  10.  Therefore,  locking  /\  n,)  in  memory  avoids  the  need  to  page  it  into  main  memory 
every  time  loop  10  executes.  Note  that  if  the  request  generated  by  an  ALLOCATE  directive  associ¬ 
ated  with  loop  10  (ALLOCATE  (2.5Vi))  is  granted,  locking  a  page  from  the  virtual  space  of  VI  has 
no  significance. 

Thus  far.  a  LOCK  directive  may  have  the  following  form  with  one  primitive: 

LOCK  ( Yx.Y2 .  ■  ■  ■  .Y„  ) 

where  Y,  is  a  particular  virtual  page.  Once  LOCK  is  executed  by  the  CPU,  a  request  is  made  to  the 
operating  system  to  lock  into  memory  those  pages  identified  by  the  virtual  addresses 
Y  i.Y 2'  '  J’n  •  Pages  are  unlocked,  or  released,  by  an  UNLOCK  directive  which  has  the  following 
form: 

UNLOCK  (Y  •  •  .Y„  )■ 

LOCK  is  inserted  inside  th  loop  and  UNLOCK  is  inserted  at  the  emt  of  L,  .  See  Example  3-10. 
Example  3-10: 

DOlOi-l.N 

V(i): 

LOCK  PVU  ) 

DO  100  j-l.M 
Vfj) 

100  CONTINUE 
10  CONTINUE 
US’ LOCK  P,,,t 
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In  a  multi-nested  loop  structure,  pages  could  be  locked  at  various  levels  of  a  locality  struc¬ 
ture.  Therefore,  it  is  possible  that  a  program  would  be  running  with  its  lowest  level  locality 
(P-l)  while  some  pages  belonging  to  higher  level  localities  are  being  locked  in  main  memory  as  a 
result  of  executing  a  LOCK  directive.  In  case  of  high  memory  contention,  a  program  should  be 
allowed  to  run  only  with  its  lowest  level  locality.  Partial  swapping,  introduced  previously  for 
ALLOCATE,  guarantees  that  higher  level  localities  are  not  allocated  when  a  program's  request 
with  P  =1  cannot  be  granted.  In  a  similar  fashion,  the  operating  system  should  be  allowed  to 
unlock  a  previously  locked  page,  even  before  it  is  released  by  an  UNLOCK  directive.  Since  pages 
can  be  locked  at  various  levels  of  a  locality  hierarchical  structure,  a  priority  index  P  can  be  used 
to  define  the  priority  of  releasing  a  page  by  the  operating  system  before  it  is  released  by 
UNLOCK.  For  this  purpose  a  priority  index  primitive  is  introduced  into  the  LOCK  directive: 

LOCK(P,YxX2.  -  Xn). 
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Pages  locked  at  the  lowest  level  of  a  locality  structure  should  be  released  last,  since  they  are 
invoked  more  frequently  than  those  referenced  and  locked  at  higher  levels.  Therefore,  pages 
locked  at  lower  levels  of  the  locality  hierarchical  structure  should  have  a  higher  priority  than 
those  at  higher  levels.  To  be  consistent  with  the  priority  index  used  for  ALLOCATE,  smaller  P 
values  are  used  to  denote  a  higher  priority.  In  other  words,  pages  locked  with  larger  P  values  are 
released  before  pages  locked  with  smaller  P  values.  Priorities  are  assigned  to  loops  in  the  same 
way  as  for  an  ALLOCATE  directive:  see  Algorithm  3-1.  A  directive  may  have  in  principle  a 
priority  P— 1.  associated  with  the  innermost  loop.  However,  in  practice  such  a  directive  is  never 
used  because  the  memory  requirement  of  the  innermost  loop,  defined  by  the  ALLOCATE  direc¬ 
tive.  is  always  granted.  For  further  illustration,  consider  the  example  in  Figure  3-13. 

The  maximum  nest  depth  of  the  loop  construct  in  Figure  3-13  is  3.  The  priorities  assigned  to 
LI  and  L2  are  P-3  and  P-2,  respectively.  The  value  P=1  is  the  priority  of  L3.  Inside  L3  no  pages 
should  be  locked,  since  the  locality  comprised  by  L3  is  allocated  by  the  ALLOCATE  directive 
with  P-1.  Assume  that  the  ranges  of  loops  LI,  L2,  and  L3  are  K.  N.  and  M.  respectively.  Each  Y, 
page  is  referenced,  at  level  L2,  at  least  N  times  more  than  any  X,  page,  referenced  at  level  LI. 
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LOCK (3.X •  •  •  ,X„  ) 
L2 


lock (i.y  i.  ■  ■  ■  xn ) 

L3 


UNLOCK  (1  j.  •  •  ) 

UNLOCK  (K  i.  .Xn  ) 

Figure  3-13:  Example  of  LOCK  and  UNLOCK  directives 


For  this  reason,  a  page  locked  at  a  higher  level,  LI  in  this  case,  should  be  released  before  a  page 
locked  at  a  lower  level.  L2  in  our  example. 

Inserting  LOCK  at  compile  time  is  very  simple  since  LOCK  does  not  exhibit  a  hierarchical 
structure  as  ALLOCATE  does.  Algorithm  3-3  is  used  for  automatic  insertion  of  LOCK  and 
UNLOCK  directives. 

Algorithm  3-3:  Insert  LOCK  and  UNLOCK  directives; 

CASE  of  encountering  a  Loop  DO 

P  =  P  assigned  to  current  loop; 

>'  =  Page  to  be  locked  ( i=  1,2, ... ,  n); 

IF  P  **  1  THEN  INSERT 

1-  LOCK  after  the  loop  BEGIN  control  statement  ; 

2-  UNLOCK  after  the  Ux>p  END  control  statement , 


Pages  to  be  locked  by  LOCK  are  either  vector  or  array  pages  as  discussed  earlier  tor  ALLO¬ 
CATE.  For  the  case  of  a  vector.  V.  any  page  referenced  at  some  level  L,  is  likely  to  be  rereferenced 
after  the  execution  of  the  L,*.i  loop.  A  reference  to  a  vector  element  V(j)  is  translated  to  a  reler- 
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ence  to  a  virtual  page  P\-(j).  If  more  than  one  vector  element  is  referenced  at  a  particular  level, 
using  more  than  one  index,  then  it  is  possible  that  more  than  a  page  needs  to  be  locked,  depending 
on  the  value  of  the  indexed  variable.  Therefore,  a  page  to  be  locked  is  identified  by  the  referencing 
index,  j.  For  example,  if  a  vector  is  referenced  as  V( j).  V(jl),  V(j2),  then  the  page(s)  containing 
these  three  elements  is  locked.  At  compile  time,  a  candidate  page  for  the  LOCK  directive  is 
identified  by  the  vector  name  identifier  and  the  vector's  indexed  variable:  no  address  translation  is 
assumed  at  compile  time.  At  run  time,  a  reference  to  a  vector  element  V(j)  is  translated  into  the 
virtual  address  of  the  page  P\  (, ,  storing  the  element  V(i). 

The  fact  that  OS  can  release  a  locked  page  before  UNLOCK  does  so.  gives  LOCK  a  soft  pro¬ 
perty.  LOCK'S  soft  property  can  be  incorporated  into  the  partial  swapping  mechanism.  This  feature 
of  the  swapping  mechanism  further  supports  the  property  of  redistributing  memory  space  among 
processes  in  cases  of  high  memory  contention. 

For  arrays  referenced  in  a  row  major  order,  a  referenced  page  at  L,  level  is  unlikely  to  be 
rereferenced  after  the  execution  of  L,  *1  unless  the  page  size  is  larger  than  the  column  virtual  size 
of  the  array,  where  two  successive  row  elements  may  be  stored  in  the  same  page.  Therefore,  a  page 
of  .-A'  may  be  locked  only  if  the  column  virtual  size  is  less  than  the  page  size,  where  a  page  to  be 
locked  is  of  the  form  Pauj  )  where  i  is  the  row  index  and  j  is  the  column  index. 

Arrays  referenced  in  a  column  major  order.  Ac  .  are  similar  to  vectors.  Each  column,  in  fact, 
resembles  a  vector.  Therefore,  for  each  column  the  distinct  row  indexes  determine  the  virtual  pages 
that  may  be  referenced  at  a  given  level.  A  page  to  be  locked  is  identified  at  compile  time  as  P^UJ  . 
a  here  i  is  the  row  index  and  j  is  the  column  index. 

The  implementation  of  LOCK  is  fairly  simple.  A  lock  bit  (LB)  is  associated  with  each  page. 
When  a  request  is  generated  to  lock  a  page  F,  into  memory,  the  lock  bit  of  )’  is  set  to  one. 
LB(Y  )=  1 .  The  replacement  policy  avoids  replacing  any  page  with  LB=1.  The  partial  swapping 
mechanism  searches  for  pages  with  LB-1  for  unlocking  them  when  initiated  by  a  running  process. 
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A  list  data  structure  (LLSD)  similar  to  the  one  designed  for  ALLOCATE  can  be  used  to  iden¬ 
tify  those  pages  which  should  be  locked  at  each  level.  Once  a  new  loop  L,  is  parsed,  a  new  entry 
X,  is  created  and  appended  at  the  head  of  the  list:  LLSD  is  initially  empty.  Upon  exiting  a  loop  L, . 
the  list  of  virtual  pages  found  at  this  level  is  assigned  to  LOCK  directive  and  the  element  X,  is 
deleted.  If  the  exiting  loop  has  P=l.  no  LOCK  directive  is  inserted.  The  data  structures  created  for 
each  element  are  similar  to  those  described  for  ALLOCATE.  The  main  difference  is  that  for  ALLO¬ 
CATE  the  number  of  distinct  pages  that  could  be  referenced  at  a  particular  level  is  of  primary  con¬ 
cern.  whereas  for  LOCK  the  particular  pages  referenced  at  a  given  level  are  of  primary  concern. 
Moreover,  the  data  structures  associated  with  an  element  do  not  contribute  to  other  elements  in  the 
list:  therefore,  no  back  tracking  is  necessary.  An  example  is  given  in  Figure  3-14  where  the  primi¬ 
tives  of  LOCK  are  further  explained. 
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30  CONTINUE 


20  CONTINUE 
A  (i  .1) 

DO  40  j  =1  M 

A  ( j  .i  ) 


40  CONTINUE 


10  CONTINUE 


DO  10  i  =\.N 
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LOCK  (3,V(£  ).A  (i  .1)) 
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LOCK  (2.V  (y  )) 

DO  30  fc  =l.R 
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30  CONTINUE 
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DO  40  j  —  \,M 
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40  CONTINUE 


10  CONTINUE 
(:X LOCK  <YU  \A  (i  .1))* 

Figure  3-14:  Example  using  l.(JCK  and  UNLOCK  directives 


86 


3.3.  Subprogram  Sequence  Control  Under  CD 

In  this  section  our  concern  is  with  mechanisms  for  controlling  memory  allocation  when  a  sub¬ 
program  is  called.  Programs  are  usually  hierarchically  structured  into  a  main  program  and  subpro¬ 
grams.  Each  subprogram  may  call  another  subprogram  and  so  forth.  The  simplest  control  structure 
of  subprograms  can  be  explained  by  the  copy  rule  (CR).  The  effect  of  a  subprogram  CALL  state¬ 
ment  is  the  same  as  would  be  obtained  if  the  CALL  statement  were  replaced  by  a  copy  of  the  body 
of  the  subprogram  before  execution.  Viewed  in  this  way.  a  locality  may  be  comprised  partially  by 
the  calling  program  and  partially  by  the  called  subprogram.  Memory  directives  are  inserted  into 
the  program  code  after  subprogram  CALL  statements  have  been  substituted  by  the  subprogram 
body.  During  execution,  a  call  to  a  subprogram  will  have  no  effect  on  the  current  memory  alloca¬ 
tion  unless  the  called  subprogram  generates  a  new  memory  directive  with  a  new  memory  allocation 
request. 

The  copy  rule  could,  explicitly,  be  applied  and  the  body  of  a  called  subprogram  be  copied  in¬ 
line  only  if  the  subprogram  is  very  short.  Otherwise,  a  subprogram  call  is  eliminated  in  principle, 
not  in  practice.  Identifying  program  localities  under  the  copy  rule  is  a  complex  problem,  since  sub¬ 
programs  can  no  longer  be  considered  separate  entities  which  comprise  separate  locality  structures. 
Moreover,  the  depth  of  a  locality  hierarchy  is  increased  by  as  much  as  the  depth  of  subprogram 
hierarchical  structure.  Another  major  drawback  of  CR  technique  is  that  subprograms  can  not  be 
recursive.  However,  recursion  is  a  common  characteristic  of  many  algorithms  which  naturally 
leads  to  recursive  subprogram  structures.  Although  our  program  model  in  this  thesis  is  FORTRAN 
programs  which  do  not  support  recursion,  it  is  desired  to  extend  the  application  of  CD  to  other 
languages  supporting  direct  or  indirect  recursion. 

In  order  to  simplify  the  process  of  directive  insertion,  a  subprogram  CALL  statement  should 
be  treated  as  a  regular  statement  without  affecting  the  current  locality  structure  Moreover,  a  sub¬ 
program.  when  compiled,  should  be  considered  as  a  separate  entity  consisting  of  its  own  locality 
structures.  Finally,  it  is  desired  to  allow  recursive  subprogram  calls  rather  than  just  simple 
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CALL-RETURN  subprogram  statements.  These  goals  can  be  achieved  using  the  activation  record 
technique. 

Activation  record  technique 

Under  the  activation  record  technique,  memory  directives  are  inserted  in  the  usual  manner  at 
compile  time.  Subprogram  CALL  statements  are  treated  as  regular  code  statements  having  no  effect 
on  the  current  locality  structure.  Locality  structures  comprised  by  the  called  subprogram  code  do 
not  contribute  to  the  locality  structures  of  the  calling  program.  In  effect,  each  subprogram  is  con¬ 
sidered  a  separate  locality  entity. 

At  execution  time,  when  a  subprogram  is  activated  due  to  a  CALL  statement  the  calling  sub¬ 
program  or  main  program  is  temporarily  halted.  The  memory  allocation  previously  set  by  direc¬ 
tives  generated  at  the  calling  program  level  may  be  altered  by  a  directive  generated  during  the  exe¬ 
cution  of  the  callee  active  subprogram.  When  the  execution  of  a  subprogram  is  completed,  execu¬ 
tion  of  the  calling  program  resumes  at  the  point  immediately  following  the  call  of  the  subprogram. 
The  memory  allocation  at  this  point  should  be  similar  to  the  memory  allocation  at  the  point  of  exe¬ 
cuting  the  CALL  statement.  The  activation  record  technique  is  used  to  keep  records  for  memory 
directives  primitives  for  each  subprogram  as  long  as  it  remains  in  execution.  A  subprogram 
remains  in  execution  until  it  returns  control  to  the  calling  subprogram.  The  CPU  is  always  con¬ 
trolled  by  an  active  subprogram. 

At  the  time  of  a  subprogram  call,  a  new  activation  record  is  created  for  the  newly  activated 
subprogram,  w-hich  is  subsequently  destroyed  upon  its  return.  A  simple  central  stack  may  be  used 
to  store  the  activation  records  of  all  subprograms  in  execution  which  have  not  returned  yet.  The 
last  item  created  on  the  stack  must  be  the  first  item  to  be  deleted.  Similarly,  the  first  item  created 
on  the  stack  is  the  last  one  to  be  deleted.  The  implementation  of  the  subprogram  call  and  return 
proceeds  as  follows.  At  the  start  of  program  execution,  a  large  storage  is  reserved  for  the  central 
stack.  The  activation  record  for  the  main  program  is  allocated  at  one  end  of  the  block.  This 
becomes  the  bottom  of  the  stack. 
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When  a  subprogram  A  is  called,  a  storage  for  its  activation  record  is  allocated  adjacent  to  that 
of  the  main  program's  activation  record.  If  A  calls  B,  B  s  activation  record  is  allocated  adjacent  to 
A  s.  If  B  calls  C,  C's  activation  record  is  allocated  adjacent  to  C's.  and  so  on.  When  C  terminates 
and  returns  control  to  B,  C's  storage  is  deleted,  and  then  B's  when  B  returns,  and  so  on.  The  central 
stack  implementation  for  a  series  of  subprogram  calls  and  returns  is  shown  in  Figure  3-15. 

Each  activation  record  contains  several  data  objects.  One  data  object  is  used  to  store  th e  return 
address  of  a  subprogram  which  can  be  thought  of  as  a  pointer  pointing  at  the  previous  activation 
record  in  the  central  stack.  Return  address  values  form  a  linked  list  that  links  together  the  activa¬ 
tion  records  on  the  central  stack  in  the  order  of  their  creation.  The  current  environment  pointer 
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PROC.  B 


1:  CALL  A 


2:  CALL  B 


MAIN!  1  MAIN!  A 


4  MAIN  A  B 


6:  MAIN  A 


5:  MAIN!  2:  MAIN  B 


6:  I  MAIN;  3:  END 


Figure  3-15:  Use  of  a  central  stack  of  activat.on  records 
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(CP)  is  constantly  updated  to  point  at  the  "top"  activation  record  in  the  stack.  From  the  return 
address  value  in  the  return  point,  the  second  activation  record  in  the  stack  may  be  reached.  From 
the  return  address  value  of  this  record,  the  third  activation  record  can  be  reached,  and  so  on.  At 
the  end  of  this  chain,  the  last  link  leads  to  the  activation  record  for  the  main  program.  This  chain  is 
called  a  dynamic  chain  because  it  chains  together  subprogram  activations  in  the  order  of  their 
dynamic  creation. 

Our  main  concern  in  this  study  is  with  the  directive  primitives  l  DP)  data  object.  The  direc¬ 
tive  primitives  data  object  is  used  to  store  the  values  of  the  (P  .X  )  pair  used  by  the  ALLOCATE 
directive.  The  entries  of  an  activation  record  are  shown  in  Figure  3-16.  The  current  memory  allo¬ 
cation  of  a  program  is  determined  by  the  values  of  the  (P  .X  )  pair  of  the  "top"  activation  record 
specified  by  CP.  While  the  new  subprogram  is  executing,  the  contents  of  P  and  X  are  constantly 
changing  as  new  directives  are  executed.  W'hen  a  subprogram  terminates,  its  activation  record  is 
deleted  together  with  its  data  objects.  Now  CP  points  at  the  second  activation  record  in  the  stack. 
A  previously  terminated  subprogram  resumes  execution  by  restoring  the  values  of  ( P .X )  pair, 
among  other  data  objects  recorded  at  the  time  of  executing  a  CALL  statement.  The  memory  alloca¬ 
tion  of  a  program  is  determined  by  the  values  of  the  ( P  ,X  )  pair  obtained  from  the  activation 
record  at  the  top  of  the  stack,  pointed  at  by  CP. 

W  hen  a  subprogram  A  calls  subprogram  B  (executes  CALL  B  statement)  the  directive  primi¬ 
tives  entry  of  A  s  activation  record  contains  the  values  PA  and  XA  .  B  executes  for  a  while  and  then 


Figure  3-16:  Activation  record  entries 
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terminates.  A  then  resumes  its  ex  cution.  requesting  the  allocation  of  XA  pages  with  a  priority  PA  ■ 
The  request  is  satisfied  if  XA  pages  can  be  allocated  from  the  free  page  pool  in  main  memory.  It  is 
possible  that  pages  can  not  be  allocated,  although  subprogram  A  was  running  with  X^  pages  at 
the  time  of  its  interruption.  In  such  a  case,  OS  invokes  the  swapper  if  PA  =  1 .  Otherwise,  the  execu¬ 
tion  continues  with  the  current  allocation  until  the  next  directive  is  received  with  a  new  pair 
( P.X ). 
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Rather  than  storing  only  one  (P.X  )  pair  in  the  activation  record,  it  is  more  effective  to  store 
the  set  of  pairs  (P j.X  ^.(/VXi).  •  •  •  .(P„  ,X„  )  specified  by  the  argument  list  of  ALLOCATE  associ¬ 
ated  with  the  lowest  level  locality.  When  subprogram  A  resumes  its  execution  after  the  termina¬ 
tion  of  subprogram  B.  the  activation  record  of  A  is  searched  for  a  pair  P,  .X,  that  can  be  allocated. 
The  first  pair  to  be  tried  for  allocation  is  P ,.X !  and  then  /’i.X;.  and  so  on  until  P- l.X  is  reached. 
Note  that  PX>P2>  •  >  P„  and  X!>X2>  ■  >X„  .  This  scheme  avoids  the  need  to  wait  for  the 

arrival  of  a  new  directive  when  the  current  directive  entry  pair  (P  .X  )  can  not  be  allocated.  More¬ 
over.  it  reduces  the  cost  of  processing  a  directive  as  will  be  discussed  in  the  next  section.  An  exam¬ 
ple  is  shown  in  Figure  3-17.  A  multi-nested  loop  structure  with  a  maximum  nest  depth  of  3  is 
shown  in  the  figure  with  ALLOCATE  directives  inserted  at  the  appropriate  levels  (ALL  stands  for 
ALLOCATE).  Memory  request  primitives,  X,  are  arbitrarily  assigned.  Note  that  the  number  of 
(X.P)  pairs  in  the  directive  entry  is  limited  by  the  maximum  depth  of  the  loop  structure  compris¬ 
ing  the  current  locality.  The  activation  record  of  A  is  dynamically  updated.  At  stage  1.  the  pair 
P- 3.X  =10<)  is  stored  in  the  record.  At  stage  two.  the  second  parameter  of  the  directive  is  entered 
into  the  activation  record.  At  stage  3.  the  activation  DP  entry  is  filled.  At  any  time  during  the  exe¬ 
cution  of  A.  the  memory  space  allocated  to  A  is  given  by  one  of  the  ( P  .X  )  pairs  in  the  activation 
record. 


Recursive  calls  to  a  subprogram  is  simply  implemented  by  creating  a  new  activation  record 
lor  a  subprogram  every  time  it  is  called.  The  size  of  the  central  stack  may  become  too  large  due  to 
an  increased  number  ol  activation  records  created  for  a  recursively  called  subprogram. 
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Figure  3-17:  Example  of  subprogram  sequence  control 
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3.4.  Cost  of  CD 

There  are  two  types  of  cost  associated  with  CD.  The  first  one  is  the  cost  of  inserting  directives 
at  compile  time.  The  second  one  is  the  cost  of  executing  a  directive. 

Compile  time  cost  is  less  severe  because  directives  are  inserted  only  once.  This  cost  can  be  lim¬ 
ited  by  having  the  directives  inserted  only  into  syntactically  error  free  programs.  This  restriction 
is  aimed  at  reducing  the  number  of  times  a  compiler  has  to  insert  the  directives  into  the  program 
code.  The  actual  cost  of  inserting  memory  directives  at  compile  time  is  not  measured  in  this  study. 
This  problem  is  left  for  further  research  and  studies. 

This  section  elaborates  on  the  cost  of  CD  at  execution  time.  The  cost  ot  CD  at  execution  time 
is  the  cost  of  executing  the  directives  ALLOCATE.  LOCK  and  UNLOCK.  Our  concern  here  is  with 
the  overhead  due  to  multiple  executions  of  a  directive  located  inside  a  loop.  We  will  also  discuss 
the  overhead  due  to  the  execution  of  "else"  conditional  statements  incorporated  by  ALLOCATE, 
since  a  conditional  statement  is  a  time-consuming  operation  compared  to  regular  evaluating  expres¬ 
sions.  These  two  factors  contributing  to  the  cost  of  CD  at  execution  time  are  discussed  next. 

The  structure  of  ALLOCATE  can  be  relaxed  to  exclude  the  conditional  statement  "else."  thus 
gicing  ALLOCATE  the  simple  form  ALLOCATE  (P.X)  where  P  and  .Y  are  the  primitives  associ¬ 
ated  with  a  loop.  However,  it  is  necessary  that  ALLOCATE  reflects  the  hierarchical  nature  of  a 


locality  structure  and  to  respond  to  the  constantly  changing  memory  status  of  the  system  due  to 
multiprogramming  interaction.  One  way  of  preserving  the  hierarchical  structure  of  ALLOCATE  is 
to  use  a  multiple  DP  entry  for  the  activation  record  discussed  in  the  previous  section.  When  a 
directive  is  executed,  the  values  of  the  (P.X)  pair  are  stored  in  the  activation  record  in  a  descending 
order.  When  a  second  directive  is  executed  the  values  of  its  (P.X)  pair  are  stored  in  the  activation 
record,  and  so  on  until  the  directive  at  the  lowest  locality  level  with  P«1  is  executed:  at  this  point 
the  DP  entries  are  filled. 

When  a  program  is  allocated  a  time  slice.  OS  examines  the  activation  record  at  the  top  of  the 
central  stack.  The  pairs  (P.X)  are  tried  for  allocation  in  a  descending  order.  If  OS  fails  to  allocate 
the  first  pair,  it  tries  the  second  one.  and  so  on.  until  the  last  one  is  reached.  The  swapper  is  invoked 
upon  failing  to  allocate  the  pair  {P  —  \.X  )  as  explained  in  Section  3-1-3.  vote  that  the  conditional 
statement  is  now  transferred  to  the  OS  level  of  execution,  where  OS  checks  the  values  of  the 
activation  record  and  compares  them  with  the  available  free  memory  space. 

Multiple  execution  of  a  directive  is  caused  by  executing  a  directive  located  inside  a  loop.  Obvi¬ 
ously.  the  directive  is  treated  as  a  regular  instruction  which  has  to  be  executed,  unless  otherwise 
stated.  Such  multiple  execution  of  a  directive  adds  to  the  cost  of  CD.  especially  when  the  memory 
status  has  not  changed  since  the  last  time  the  directive  was  executed,  in  which  case  the  execution  of 
a  directive  is  a  mere  overhead.  Using  a  multiple  directive  point  entry  in  the  activation  record  and 
the  relaxed  form  of  ALLOCATE  proves  to  be  useful  in  reducing  the  number  of  times  a  directive  is 
executed.  A  directive  inserted  at  a  higher  level  needs  not  be  executed  at  lower  levels  because  us 
primitives  have  already  been  stored  in  the  activation  record.  However,  a  lower  level  directive, 
although  relaxed,  still  has  to  be  executed  every  time  the  loop  containing  the  directive  iterates 

The  optimal  solution  to  this  problem  is  to  move  all  the  directives  outside  the  loop  structure 
This  can  be  done  either  at  compile  time  when  the  directives  are  inserted,  or  at  run  time  when  the 
directives  are  first  executed.  Eventually,  all  the  directives  of  a  loop  structure  will  be  stored  in  the 
activation  record.  Therefore,  if  the  removal  of  a  directive  is  to  lake  place  at  run  time,  then  once  a 
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directive  is  processed,  its  primitives  are  stored  in  the  DP  entry  of  the  activation  record  and  the 
directive  is  removed  from  the  program  code. 

The  cost  of  executing  the  directives  in  their  original  form,  without  relaxation  and  without 
using  activation  records  with  multiple  data  entry,  is  measured  in  this  study.  The  results  are 
reported  in  the  next  chapter. 

3-5.  Summary  and  Conclusions 

We  have  presented  in  this  chapter  a  compiler  directed  memory  mangement  policy  (CD).  Three 
memory  directives.  ALLOCATE  and  LOCK,  and  UNLOCK  are  inserted  at  compile  time  into  the 
program's  source  code.  When  a  directive  is  executed  by  the  CPU  during  execution  time,  it  generates 
a  request  to  OS  to  allocate  X  number  of  pages  or  to  lock  into  memory  a  particular  page.  We  have 
developed  algorithms  for  inserting  directives,  automatically,  at  compile  time.  These  algorithms 
utilize  source  level  code  information  to  identify  program  localities  and  to  evaluate  the  size  of  these 
localities. 

We  have  also  treated  the  problem  of  subprogram  control  sequence  using  the  activation  record 
technique  A  subprogram  may  be  defined  as  a  subroutine,  a  function  or  a  procedure.  Wrhen  a  sub¬ 
program  is  called,  program  locality  structures  will  be  redefined  according  to  the  localities  present  in 
the  newly  called  subprogtam.  Therefore,  the  memory  requirements  of  a  program  are  also  redefined 
by  the  recently  called  subprogram.  However,  when  a  subprogram  returns,  the  memory  require¬ 
ments  of  the  main  program  have  to  be  restored.  This  problem  is  handled  by  creating  a  new  activa¬ 
tion  record  tor  each  subprogram  whenever  it  is  called.  The  activation  record  contains  the  most 
recent  information  generated  by  memory  directives.  In  particular  each  activation  record  contains 
the  values  of  (P.\)  pa'',  where  X  is  the  memory  allocation  request  and  P  is  the  priority  o!  alloca¬ 
te  n  1  a«.h  activation  record  has  a  pointer  pointing  at  the  prev  ious  one.  thus,  forming  a  dynamic 
viiain  connecting  all  activation  records  in  the  order  of  their  creation. 

The  cost  o!  executing  memory  directives  has  also  been  discussed  in  this  chapter.  A  variation 
of  ^LLOUATF  directive  structure  may  be  used  to  reduce  the  frequency  of  executing  a  directive. 
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The  compiler  directed  policy  can  be  implemented  in  such  a  way  that  a  directive  does  not  have  to  be 
located  inside  a  loop  structure,  where  it  has  to  be  executed  several  times. 

The  performance  of  CD  in  multiprogramming  systems  is  of  significance  importance  to  this 
studv.  The  CD  policy  is  designed  to  be  able  to  react  to  the  constantly  changing  status  of  the  free 
memory  available  on  the  system  due  to  multiprogramming.  For  this  purpose.  CD  incorporates  a 
swapping  mechanism.  The  swapping  mechanism  initiates  a  swapping  process  if  the  minimal 
memory  requirement  of  a  running  process  exceeds  the  amount  of  free  memory  available  on  the  sys¬ 
tem.  Moreover,  the  swapping  mechanism  incorporates  a  partial  swapping  strategy.  Partial  swapping 
allows  a  swapped  out  process  to  maintain  a  resident  set  in  memory.  However,  the  resident  set  size 
of  a  swapped  out  process  is  reduced  to  its  minimal  memory  requirement  specified  by  the  directive 
associated  with  its  lowest  level  locality. 

The  performance  of  CD  in  a  multiprogramming  system  is  evaluated  in  the  next  chapter.  We 
will  examine  the  fault  rate  characteristics  of  CD  among  other  performance  measures.  The  useful¬ 
ness  of  partial  swapping  is  investigated.  Finally,  we  will  compare  the  performance  of  CD  with  the 
performance  of  WS  in  a  multiprogramming  system. 
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CHAPTER  4 


PERFORMANCE  EVALUATION  AND  MEASUREMENTS 


4.1.  Introduction 

The  importance  of  performance  and  its  evaluation  in  all  technical  fields  is  obvious.  Ferrari 
[22]  considers  performance  evaluation  as  indispensable  for  the  viability  of  any  technical  system  as 
the  functionality  and  economicity .  The  previous  chapter  has  addressed  the  other  two  categories: 
functionality  and  economicity  of  CD.  The  main  goal  of  this  chapter  is  to  evaluate  the  performance 
of  CD  in  a  multiprogramming  system. 

The  term  performance  is  understood  in  the  context  of  the  performance  indexes  used  in  this 
study.  The  most  common  performance  index  of  paging  systems  is  the  page  fault  characteristics. 
The  number  of  page  faults.  F.  is  a  significant  index  by  itself  whcih^erves  as  a  measure  of  the  traffic 
between  virtual  and  real  memory.  It  also  reflects  the  lifetime  of  a  process:  the  lifetime  of  a  process 
is  inversely  proportional  to  F.  A  process's  lifetime  is  commonly  used  to  model  program  behavior 
[13],  [19],  Also.  F  can  be  used  as  a  measure  of  a  process’s  virtual  turn  around  time.  A  virtual  turn 
around  time  ignores  the  delay  time  in  queues  wailing  for  other  processes  to  be  served  However,  the 
virtual  turn  around  time  differs  from  that  obtained  from  a  uniprogramming  environment.  This 
difference  results  from  the  swapping  activity  which  is  a  characteristic  of  multiprogramming  sys¬ 
tems  only.  The  virtual  time.  VT.  of  a  process  is  given  by 

IT  =  T  +  F  XL 

where  T  is  the  length  of  the  reference  address  trace:  each  memory  reference  is  one  lime  unit.  F  is 
the  number  of  page  faults  generated  during  the  execution  time  of  a  process.  1.  is  the  time  needed  to 
service  a  page  fault.  L  includes  the  lime  needed  to  interrupt  CPU  and  to  transfer  control  to  a  pag¬ 
ing  device:  the  seek  time  needed  to  locate  a  missing  page  in  the  virtual  storage:  and  the  time  to 
transfer  a  page  from  disk  (the  v  irtual  storage,1  to  main  memory.  The  real  turn  around  lime.  RT .  of 
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a  process  includes  the  waiting  time  in  system  queues: 

RT  =T+FxL  +  Q 

w  here  Q  is  the  time  a  process  spends  in  the  system's  queues  waiting  for  a  service.  In  this  study  we 
find  the  number  of  page  faults  for  each  process  in  the  system.  Fp  ,  and  for  all  processes  in  the  sys¬ 
tem.  Fs>s  .  where 

v 

f  =  y  f 

*  sys  Lm t  *  p 
p  =  1 

The  space  time  cost.  ST  is  another  performance  index,  commonly  used  to  evaluate  memory 
management  policies.  ST  is  the  time  integral  of  the  memory  space  occupied  by  a  process.  Obvi¬ 
ously.  a  process  may  occupy  memory  space  while  it  is  running,  or  while  in  the  process  queue  (PQ) 
waiting  for  a  time  slot,  or  while  in  the  fault  queue  (FQ)  waiting  for  a  page  to  be  paged  into  main 
memory.  The  real  space  time  cost  of  a  process  is  given  by 

ST  =  ZS,  +  Lx  Z  S,  +  Z\ 

,-l  /  =1  y=! 

where  S,  is  the  space  occupied  by  a  process  at  virtual  lime,  i  :  Sf  is  the  space  occupied  by  a  process 
during  the  service  of  a  page  fault:  and  is  the  memory  space  occupied  by  a  process  while  waiting 
in  the  process  queue  for  a  CPU  time  quantum  Space  time  cost  is  a  system  performance  index, 
rather  than  a  process  specific.  From  the  user  point  of  view,  it  is  desired  to  minimize  the  running 
time  of  a  process  irrespective  of  the  memory  space  it  occupies  during  its  execution. 

Maximizing  the  throughput  of  the  system  is  a  desired  goal  from  the  system's  point  of  view. 
The  throughput  is  the  number  of  jobs  completed  per  unit  time.  With  the  throughput  in  mind  as  a 
performance  index,  the  space  time  cost  becomes  an  .mportant  criterion  of  performance.  The  results 
of  queuing  network  analysis  claim  that  a  maxim jm  throughput  can  always  be  achieved  if  each 
process  in  the  system  runs  w  ith  a  minimal  space  time  cost  [  1 2 ] .  [20].  A  theoretical  support  for  this 
claim  is  based  on  the  assumption  that  memory  capacity  is  completely  utilized.  Assume  that  the 
total  memory  space  is  f?  pages,  and  .V  processes  are  running  for  r  time  units.  The  average  space 
time  cost  per  process  is  given  by: 
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ST  = 


The  system  throughput.  0.  is  given  by 


0  =  —  or  N  =  t  X0.  hence  .  ST  =  -^-Xr  . 

r  0xr 

The  above  formula  implies  that  maximizing  the  system  throughput.  0,  can  be  achieved  by  minimiz¬ 
ing  the  space  time  cost  of  every  process  in  the  system,  or  equivalently,  minimizing  the  overall  sys¬ 
tem  space  time  cost.  The  overall  system  space  time  cost.  STsvs .  is  given  by 

STsy ,  =  LSTp  . 

p  =i 

The  empirical  results  presented  in  this  chapter  contradict  the  above  conclusion.  However,  the  space 
time  cost  is  still  an  important  criterion  of  performance.  Memory  management  policies  have  been 
designed  and  proposed  to  optimize  the  space  time  cost  of  a  running  process:  among  these  are  WS 
[IS]  and  DMIN  [10]  (an  optimal  dynamic  memory  management  policy). 

The  average  memory  size  allocated  to  a  process,  or  the  average  resident  set  size  of  a  process. 
V  .  is  commonly  used  as  a  performance  measure  of  memory  management  policies.  V  is  useful  in 
studying  the  locality  property  of  program  behavior:  it  also  helps  evaluating  the  ability  of  a  policy 
to  measure  the  memory  demands  of  a  program. 

We  have  already  mentioned  that  throughput.  0.  is  used  as  a  system  performance  index.  A 
system  manager  w.->uld  like  to  increase  the  output  of  his  system  by  maximizing  the  throughput. 
However,  this  should  not  happen  at  the  expense  of  slowing  down  the  execution  of  some  processes. 
A  tradeoff  must  be  made  between  the  interests  of  individual  processes  and  the  system  as  a  whole, 
e  g.,  minimizing  a  process's  turn  around  time  versus  maximizing  system's  throughput. 

A  multiprogramming  specific  measure  index  is  the  swapping  rate.  Program  behavior  in  a  mul- 
; .rrogramming  system  is  not  a  function  only  of  its  intrinsic  properties,  it  also  depends  on  the 
behavior  of  other  rroce  ses  in  the  system.  For  global  memory  management  policies,  a  running  pro¬ 
cess  may  replace  the  pages  of  any  other  process.  For  local  dynamic  memory  management  policies,  a 
running  process  may  swap  out  of  memorv  the  resident  set  of  any  other  process.  The  swapping  rate 
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is  defined  in  this  thesis  as  the  lota!  number  of  a  process's  pages  swapped  out  of  memory  as  a  result 
of  a  swapping  operation  initiated  by  a  running  process.  The  swapping  rate.  I,  is  a  significant  perfor¬ 
mance  index  by  itself.  Moreover.  I  has  an  impact  on  the  page  fault  rate,  as  we  have  discussed  in 
Chapter  2.  Also.  E  is  responsible  for  the  anomalous  behavior  of  WS  (see  Chapter  2). 

In  this  chapter  the  performance  measures,  discussed  above,  are  used  to  evaluate  the  perfor¬ 
mance  of  CD,  whcih  will  be  compared  with  WS.  The  WS  policy  is  chosen  because  it  has  been  advo¬ 
cated  in  the  literature  [20]  as  a  near  optimal  policy.  E3esides.  most  of  the  dynamic,  nonglobal  poli¬ 
cies  proposed  to  manage  memory  hierarchies  are  derivatives  of  WS.  For  example,  the  Damped 
Working  Sets  (DWS)  [36]  modifies  the  WS  slightly  to  improve  its  behavior  during  interlocality 
transitions.  DWS  outperforms  WS  by  no  more  than  10?c  in  terms  of  minimizing  the  space  time  cost 
[2< )].  The  Sampled  Working  Set  (SWS)  [34]  has  been  proposed  to  reduce  the  implementation  cost  of 
WS  Ferrari  and  Vih  [23]  proposed  the  Variable  Interval  Sampled  Working  Set  policy  (YSWS) 
w  hich  combines  the  properties  of  SWS  and  DWS.  The  performance  of  YSWS  is  comparable  to  WS. 
(ilobai  policies  which  are  not  WS  descendants,  such  as  global  L.RL'  and  global  CLOCK,  have  been 
assumed  to  perform  worse  than  WS  [20].  The  page  fault  frequency  policy  (PFF)  [  1 4 ]  also  achieves 
similar  to  WS  performance.  Carr's  proposed  policy  WSclock  [  1 3 ]  is  an  approximation  and  global 
implementation  of  WS:  WSclock  performs  nearly  the  same  as  the  pure  WS. 

It  most  be  pointed  out.  however,  that  CD  is  not  being  compared  with  the  optimal.  But  since 
WS  and  its  variations  are  considered  to  have  near  optimal  performance,  CD  is  compared  with  the 
WS  polic;. .  This  chapter  presents  compelling  ev  idence  that  CD  performs  better  than  WS  in  many 
aspects 

Belore  comparing  CD  wuh  WS  the  characteristics  of  CD  are  examined:  namely,  its  dynamic 
behavior,  the  partial  swapping  mechanism  employed  by  CD.  and  the  impact  of  the  context  sw  itch 
or.  CD  s  pert  ormance. 
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4.2.  Modeling  CD 


The  multiprogramming  model  used  in  Chapter  2  for  evaluating  the  performance  of  WS  is 
used  for  modeling  CD  (Figure  4-1).  Following  is  a  description  of  CD's  implementation. 

Each  process  is  represented  by  its  virtual  address  trace.  A  trace  contains  both  virtual 
addresses  and  memory  directives.  The  directives  are  introduced  manually  into  the  source  level 
code.  Each  directive  is  represented  by  an  integer  number  with  a  value  larger  than  1000.  The  most 
significant  digit  is  the  priority  index,  P ,  of  the  directive  and  the  rest  of  the  digits  constitute  the 
memory  request,  X .  For  example,  a  reference  of  the  form  2120  is  interpreted  as  a  directive  with 
P  —2  and  X  =120. 

Each  process  maintains  a  list  of  its  referenced  pages  in  the  main  memory.  The  memory  space 
reserved  for  a  process  is  determined  by  the  X  value  of  the  last  processed  directive.  All  paging 
activities  of  a  process  occur  within  its  specific  memory  area.  The  resident  set  of  a  process  grows  or 
shrinks  upon  processing  a  directive  or  as  a  result  of  a  swapping  operation. 
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Figure  4-1;  Multiprogramming  model 
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The  CD  policy  does  not  have  any  control  parameter.  For  this  purpose.  CD  exhibits  no  control¬ 
lability  problems  at  run  time.  The  directives  are  inserted  at  compile  time,  and  evaluated  at  run 
time  in  light  of  the  status  of  the  free  memory.  For  each  9  value.  CD  generates  only  one  value  for 
each  of  the  performance  indexes  described  in  Section  1.  whereas  WS  needs  to  be  tuned  in  order  to 
achieve  a  particular  performance.  The  window  size.  r.  has  to  be  properly  chosen.  Moreover, 
different  values  of  t  might  be  needed  to  optimize  different  performance  criteria. 

System  parameters  are  used  as  control  parameters  in  order  to  generate  more  results  of  CD  and 
to  study  the  performance  of  CD  in  different  environments.  For  example,  a  wide  range  of  6  is  used 
to  demonstrate  the  performance  of  CD  in  small  and  large  memory  systems.  The  value  of  0  is  varied 
from  6  to  500  pages.  Also,  the  multiprogramming  level.  MPL.  is  varied  between  3  and  10.  Another 
system  variable  is  the  context  switch.  CS.  Several  values  of  CS  are  used  ranging  from  100  to  =1000 
time  units. 

The  CD  policy  has  the  option  of  using  or  not  using  the  partial  swapping  feature  described  in 
Chapter  3.  The  partial  swapping  mechanism  is  implemented  as  follows.  Each  process  keeps  a  record 
of  its  current  allocation  and  the  memory  request  associated  with  P  —  1.  The  swapping  mechanism 
keeps  a  circular  list  of  all  the  processes  in  the  system  and  a  pointer  pointing  at  the  next  candidate 
process  for  swapping.  Upon  invoking  the  swapping  mechanism,  the  processes  are  periodically  exam¬ 
ined  searching  for  a  process  occupying  memory  with  P  >  1.  If  such  a  process  is  found,  its  memory 
allocation  is  reduced  to  that  associated  with  P  —  1:  this  value  is  stored  in  the  directive  record  (used 
to  store  the  values  of  a  directive's  parameters)  of  the  process.  The  difference  in  memory  space, 
between  the  old  .V  and  the  new  .Y  .  is  added  to  the  free  memory  pool.  If  all  the  processes  have  been 
forced  to  run  with  P~  \  and  the  free  memory  pool  size  is  too  small  to  satisfy  the  current  memory 
request,  a  total  swapping  is  applied,  i.e..  the  entire  resident  set  of  a  process  is  pre-empted. 

When  a  process  gains  control  of  CPU.  it  is  assigned  a  memory  space  according  to  the  values 
!  'und  in  its  directive  record.  Initially,  the  directive  record  contains  the  value  ol  the  minimal 
memory  space  a  process  is  entitled  to  have.  In  this  model,  this  value  is  equal  to  D,  i.e..  the  resident 
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set  of  each  process  is  initially  empty.  When  a  process  is  removed  from  the  control  of  CPU  for  a 
time  out  interrupt,  or  for  a  page  fault  service,  its  directive  parameters  are  remembered  in  the  direc¬ 
tive  record.  The  CD  policy  does  not  keep  a  record  of  the  members  of  its  resident  set  at  a  time  of 
relinquishing  CPU.  Upon  regaining  control  of  the  CPU.  CD  demands  its  resident  sets  pages  back 
into  memory:  WS  is  implemented  in  a  similar  manner. 


4.3.  CD  Characteristics 


4.3.1.  Dynamic  memory  allocation 

The  amount  of  physical  memory  allocated  to  a  program  vary  during  execution  for  two  rea¬ 
sons.  The  first  one  is  attributed  to  a  program  s  intrinsic  locality  characteristics.  A  transition  from 
one  locality  structure  to  another  results  in  a  change  in  the  memory  requirement  of  a  program. 
Therefore,  the  amount  of  memory  allocated  to  a  process  may  vary  every  time  a  new  directive  is 
executed  and  its  request  is  satisfied.  Variable  memory  allocation  also  occurs  within  a  locality  struc¬ 
ture.  The  amount  of  memory  allocated  to  a  process  may  be  reduced  due  to  a  partial  swapping 
operation.  This  happens  when  a  process  is  occupying  memory  w  ith  P  >  1.  i.e..  low  priority.  Also,  a 
process  may  switch  its  memory  allocation  from  that  required  by  a  lower  level  locality  to  a  larger 
one  requested  by  a  higher  level  locality:  this  is  viable  because  the  size  of  free  memory  may  be 
increased  if  a  process  releases  some  memory  or  a  process  completes  its  execution. 

The  variation  in  the  memory  size  allocated  to  a  process  is  not  expected,  however,  to  be  abrupt. 
A  process  is  expected  to  spend  some  lime  inside  a  locality:  therefore,  directives  are  spread  apart  by 
the  duration  of  a  locality.  Once  a  directive  is  processed  and  a  particular  request  is  satisfied,  the  size 
of  free  memory  is  not  expected  to  change  until  another  process  executes  and  changes  its  memory 
requirements.  For  both  these  reasons,  it  is  not  expected  to  notice  an  abrupt  change  in  the  amount  of 
memory  allocated  to  a  process  over  execution  time.  In  this  section  we  report  some  results  about  the 
dvnamic  memorv  allocation  obtained  under  CD. 
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In  Figure  4-2.  the  memory  space  allocated  to  a  process  is  plotted  versus  real  time.  Five  plots 
are  shown  in  the  figure,  one  for  each  of  the  programs.  The  plots  are  generated  for  MPL-3  and  three 
values  of  0  (0-5 0.  100.  300).  Consider,  for  example,  program  MAIN.  For  0-50,  all  memory 
requests  larger  than  50  pages  are  not  satisfied.  Memory  allocation  varies  between  1  page  and  17 
pages  according  to  the  directive: 

ALLOCATE  (2.17 )  else  (1.1) 

The  memory  request.  X=60,  generated  by  the  directive 

ALLOCATE  (2.60)  else  (1.2) 

cannot  be  satisfied  when  0=50.  However.  60  pages  can  be  allocated  when  0—100.  Depending  on  the 
size  of  the  free  memory  pool,  memory  allocation  varies  between  1.  2,  4.  and  60  pages:  this  variation 
represents  intra-locality  transition.  An  example  of  transition  from  one  locality  to  another  (inter- 
localitv  transition),  is  illustrated  in  the  time  region,  t— 1.28x10*  and  1-1.31x10*.  where  the  amount 
of  memory  allocated  to  program  MAIN  changes  from  60  to  3  pages.  With  larger  values  of  0. 
memory  allocation  within  a  locality  structure  seems  to  be  stationary;  the  first  request  of  a  directive 
(and  the  largest)  is  allocated  most  of  the  time.  For  program  MAIN.  60  pages  are  always  allocated 
whenever  a  directive  is  executed  of  the  form  ALLOCATE  (2,60)  else..  Similar  observations  can  be 
made  from  the  analysis  of  the  rest  of  the  figures. 

Compared  to  other  dynamic  policies,  such  as  WS  and  global  algorithms.  CD  does  not  exhibit 
high  implementation  costs.  A  program's  resident  set  does  not  have  to  be  updated  or  computed  at 
everv  memory  reference  time.  The  number  of  limes  a  resident  set  has  to  be  updated  is  limited  by 
the  frequency  of  generating  directive  requests.  The  plots  in  Figure  4-2  show  that  for  the  five  pro¬ 
grams,  memorv  allocation  does  not  change  in  a  rapid  continuous  fashion:  it  is  rather  discrete  and 
widely  spread  over  time. 
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4.3.2.  Partial  swapping 

A  major  characteristic  of  CD  is  that  it  prohibits  any  program  from  running  unless  there  is 
enough  memory  space  to  allocate  at  least  one  level  of  its  current  local itv  structure.  The  (  D  policy 
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Figure  4-2:  Dynamic  memory  allocation  under  CD 


incorporates  a  swapping  mechanism  to  facilitate  this  feature.  The  swapping  mechanism  has  been 
discussed  in  Chapter  3.  Partial  swapping  is  introduced  in  order  to  give  a  process  a  chance  to  keep 
some  pages  in  memory  before  it  is  completely  demoted.  Hopefully,  with  partial  swapping  more 
processes  can  share  the  memory  and  CPU  resources  at  no  risk  of  thrashing.  One  may  argue  that 
partial  swapping  may  increase  the  number  of  processes  in  the  system  at  the  expense  of  generating 
more  page  faults.  In  a  multiprogramming  environment,  it  is  not  easy  to  agree  or  disagree  with  this 
argument  on  a  purely  theoretical  basis.  On  one  hand  it  might  be  true  that  less  memory  allocation 
results  in  more  fault  rate,  or  at  least  should.  On  the  other  hand,  the  lowest  level  locality  might 
have  a  time  duration  even  longer  than  the  context  switch  lime  interval  given  to  a  process.  In  this 
case  partial  swapping  can  have  only  a  positive  impact  on  the  performance  of  an  individual  process 
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and  of  the  system.  Besides,  total  swapping  may  result  in  extra  completely  unutilized  memory 
space. 

The  empirical  results  reported  in  this  section  demonstrate  the  impact  of  using  a  partial  swap¬ 
ping  strategy  along  with  a  total  swapping.  For  MPL=10  and  0=50,  the  total  number  of  system  page 
faults  is  39015  with  partial  swapping  turned  on;  without  partial  swapping  the  page  faults 
increased  to  53220.  a  difference  of  12105  faults.  However,  for  larger  values  of  0  the  big  difference 
disappears.  For  example,  the  number  of  page  faults  for  0=150  with  partial  swapping  on  is  only  3 
faults  less  than  the  fault  rate  without  partial  swapping.  For  0=200.  300,  and  400  the  difference 
disappears  completely. 

For  each  process  in  the  system  the  total  number  of  pages.  £.  swapped  out  from  the  resident 
set  of  a  process  is  recorded.  Figure  4-3  presents  a  plot  of  L  versus  0  for  programs  MAIN.  INIT.  and 
HWSCRT  with  MPL-5. 

Our  results  show  that  partial  swapping  can  indeed  result  in  more  page  faults.  However,  for 
high  memory  contention  cases,  partial  swapping  has  the  tendency  to  generate  less  faults  than  total 
swapping. 

4.3.3.  Effect  of  context  switch 

One  reason  for  a  process  to  relinquish  CPU  is  to  use  up  its  context  switch  interval  where  a 
time  out  interrupt  is  generated.  A  process  may  actually  use  all  of  its  time  slot  if  it  is  not  inter¬ 
rupted  during  the  context  switch  period.  In  our  model,  the  only  other  interrupt  is  caused  by  a  page 
fault.  Therefore,  if  the  average  lifetime  of  a  process,  i.e..  the  time  between  successive  page  faults,  is 
larger  than  he  context  switch  value,  CS,  a  process  is  likely  to  use  up  its  time  slot  before  a  page 
fault  occurs.  Similarly,  if  the  time  between  successive  faults  is  shorter  than  CS.  a  process  is  likely 
to  lose  control  of  CPU  before  its  context  time  runs  out. 

The  average  lifetime  of  a  process  is  inversely  proportional  to  the  fault  rate.  The  average  vir¬ 
tual  lifetime.  G  .  of  a  process  is  defined  as  G  —T  F  w  here  T  is  the  length  of  the  address  reference 
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string,  and  F  is  the  number  of  page  faults.  The  average  virtual  lifetime  is  maximum  if  F  is 
minimum.  The  minimal  number  of  faults  in  a  demand  paging  system  is  equal  to  the  number  of 
pages  in  the  virtual  space  of  a  process.  The  maxirr  um  value  of  G  for  the  programs  used  in  our 
experiments  is  given  in  Table  4-1.  Averaging  over  all  of  the  programs,  the  maximum  average  life¬ 
time  is  365.  In  reality  the  lifetime  of  each  process  is  lower  than  the  values  given  in  Table  4-1 
because  the  actual  number  of  faults  is  much  higher  than  the  absolute  minimum. 

All  of  the  results  reported  in  this  thesis  use  CS=1000.  This  value  is  large  enough  to  exclude 
the  impact  of  CS  on  the  results.  Mainly  we  are  interested  in  paging  related  characteristics  of  pro¬ 
gram  behavior.  However,  we  report  in  this  section  more  results  using  smaller  values  of  CS.  The 
fault  rate  characteristics  are  observed  under  different  values  of  CS. 

In  Figure  4-4  we  plot  the  fault  rate  versus  0  for  each  program  (MPL»5).  Two  values  of  CS 
are  used.  CS~100  (solid  line)  and  CS=*1000  (dotted).  From  the  curves  in  Figure  4-4  we  note  that 
most  of  the  programs  favor  larger  values  of  CS.  Program  MAIN,  with  Gmax=10l7,  generates 
significantly  less  faults,  with  CS-1000.  than  with  CS*100,  especially  with  0  in  the  range  75—100 
pages.  For  example.  F  (0=8O.CS  =100)=5524  and  F  (0=80, CS  =1000)=852.  a  difference  of  4S72 
faults.  For  large  0  values,  the  difference  page  faults  for  different  CS  values  disappear  since,  for 
large  0  values,  the  number  of  faults  is  considerably  low.  and  a  process  is  allowed  to  use  all  of  its 
time  slice.  Moreover,  the  swapping  rate  is  considerably  lower  with  large  values  of  0.  and  therefore, 
a  process  is  likely  to  retain  its  resident  set  pages  when  it  regains  control  of  CPU  during  the  next 
context  sw  itch. 


Table  4-1 

Maximum  lifetimes  of  programs 


Program 

j  Kef.  Length  (D 

i 

Virtual  Size 

■ 

G  max 

|  MAIN 

79.325 

i  75 

1017 

FIELD 

10,523 

60 

LSI 

I N  IT 

i  10.745 

1  74 

i  62 

1  CONDUCT 

52.452 

29) 

:s3 

i  HWSCRT 

22.721 

76 

299 
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4-4d:  CONDUCT,  MPL-5.  CS-100  — .  1000  ... 


4-4e:  HWSCRT.  MPL-5.  CS-100  — .  1000  ... 

Figure  4-4:  Effect  of  context  switch  on  page  faults 

System  fault's  curves  for  three  values  of  CS  (100  solid,  1000  dotted.  2000  dashed)  are  shown 
in  Figures  4-5a  and  4-5b  for  MPL-5  and  10.  respectively.  The  curves  are  almost  identical  for 
0>1OO  and  0>3O<>  for  MPL-5  and  MPL-10.  respectively.  For  smaller  9  values,  smaller  CS  values 
generate  a  larger  number  of  faults.  For  small  9  values,  the  swapping  activity  is  considerable  and. 
therefore,  it  is  possible  that  a  process  be  swapped  out  before  its  next  time  slot.  Using  a  relatively 
large  CS  allows  a  process  to  benefit  from  those  pages  it  has  paged  into  its  resident  set. 

However,  using  a  large  CS  value  affects  the  response  time  because  a  process  has  to  wail  too 
long  in  the  process  queue  before  us  next  scheduling  time.  Small  CS  values,  as  discussed  above,  have 
the  tendency  to  generate  more  faults  and  Consequently  increase  the  turn  around  time  of  a  process. 
There  is  a  tradeoff  between  response  time  and  turn  around  time.  Response  time,  however,  has  to  be 


4-5b:  SYSTEM.  MPL=10.  CS=100  — .  1000  ....  2000 - 

Figure  4-5:  Effect  of  context  switch  on  page  faults 

acceptable  to  human  norms.  And  therefore,  a  maximum  response  time  can  be  enforced  by  using  a 
global  context  switch,  g  The  distribution  of  g  among  the  processes  depends  on  the  lifetime  of  a 
process  and  the  number  of  processes  in  the  system.  The  general  criterion  is  that  a  process  should  be 
allowed  to  continue  using  CPU  as  long  as  it  does  not  generate  a  reference  to  a  non-resident  page, 
i.e..  page  1  a u  1 1 .  However,  the  smooth  behavior  of  a  process  should  not  be  a  reason  to  keep  other 
processes  waiting  in  the  queue:  after  all.  these  processes  may  have  i  smooth  behavior  as  well. 
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Therefore,  a  process  should  be  pre-empted  from  CPU  if  it  exceeds  a  threshold  value.  Following  is  a 
dynamic  strategy  for  allocating  time  quantums  to  running  processes. 

Let  g  be  a  global  context  switch;  g  is  set  to  a  maximum  value  m  .  Also,  let  N  be  the  number 
of  processes  which  have  not  been  scheduled  yet  to  run  during  one  scheduling  cycle:  a  scheduling 
cycle  is  completed  when  all  the  processes  in  the  system  have  used  CPU  once  for  some  time.  Define  a 

threshold,  h  .  as  /;  =  -£-;  g  ‘s  always  evenly  distributed  among  the  remaining  processes  in  the  sys¬ 
tem.  Every  time  a  process  leaves  CPU  after  some  time  t  .  g  is  updated  as  g—g~t  -  and  1 V  is 
updated  as  /V=iV— 1.  The  time  a  process  spent  using  CPU  is  determined  by  an  interrupt  due  to  a 
page  fault  or  a  time  out  interrupt  after  h  time  units,  whichever  occurs  first.  We  further  illustrate 
this  strategy  using  an  example. 

Example  4-1: 

Assume  that  there  are  4  processes  in  the  system.  Let  m=lOOO  time  units;  i.e..  the  maximum 
response  time  for  any  process  is  1000.  Initially,  g  =1000  and  h  =1000/4=200.  Let  process  P f  run 
until  a  fault  occurs  after  100  time  units,  t  =100:  i.e..  P i  does  not  use  all  ot  the  time  it  is  entitled  to 
( /t  =200).  At  this  point  g  is  updated  as  g  =1000-100  =900;  and  g  is  distributed  among  three 
processes  since  .V  =4—1  =3  (/i  =900/3  =300).  Note  at  tnis  point  that  the  remaining  processes  in  the 
system  have  a  higher  threshold  than  did  P  \  when  it  controlled  CPU.  Next  P 2  runs  and  uses  up  all 
of  its  time  quantum  (300  units)  before  it  generates  a  lault.  All  parameters  are  updated  as 
g  =900-300  =600:  .V  =3—1  =2;  and  h  =600/2  =300.  Assume  that  P 3  executes  until  a  page  fault 
occurs  after  150  time  units.  The  value  of  g  now  becomes  g  =600— 150  =450:  .V  =2— 1  =1.  and 
h  =450.  Process  P A  can  use  CPU  for  450  time  units  unless  it  generates  a  page  fault  :  assume  that  a 
fault  occurs  after  400  time  units.  Now  g  is  reset  to  1000  and  a  new  cycle  begins.  Note  that  no  pro¬ 
cess  in  the  system  may  wait  in  the  queue  more  than  MM)  time  units,  and  each  process  is  allocated  at 
least  250  time  units 

The  above  scneme  allows  smoothie  behaving  processes  with  low  tault  rate)  trt  take  advan¬ 
tage  of  the  short  lifetime  T  heavily  faulting  processes.  At  the  same  time  heavily  faulting  processes 
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are  not  punished  for  bad  behavior:  a  heavily  faulting  process  is  scheduled  to  run  after  at  most  m 
time  units.  In  the  above  example,  P2  could  use  CPU  for  300  lime  units  because  P j  did  not  use  all 
of  its  time.  Similarly,  P4  could  use  CPU  for  400  time  units  because  P2  was  pre-empted  before  its 
time  had  expired.  However,  Pj  is  rescheduled  after  1000  time  units  from  the  time  it  first  controlled 
CPU.  Using  static  CS  distribution.  Px  could  have  been  scheduled  after  650  time  units.  Of  course 
this  is  a  shorter  response  time,  but  it  makes  little  difference  if  m  is  chosen  within  the  range  of 
human  acceptable  reaction  (few  milliseconds  for  example)  for  interactive  systems.  Moreover,  two 
processes  could  be  interrupted  (P2  and  P 4)  although  they  could  have  used  CPU  for  useful  work. 

The  notion  of  response  time  is.  mostly,  applicable  to  interactive  systems.  In  batch  processing 
systems,  response  time  has  little  significance.  Therefore,  for  batched  scheduled  jobs,  it  is  more 
effective  if  a  process  is  allowed  to  execute  until  it  generates  a  page  fault. 

Dynamic  time  allocation  is  still  to  be  further  investigated.  One  way  to  pursue  this  issue  is  to 
look  ir.to  the  possibility  of  using  memory  directives  introduced  in  this  thesis,  or  possibly  som e  time 
directives,  to  guide  a  dynamic  time  allocation  strategy.  In  this  thesis  we  investigated  only  static 
time  allocation. 
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4.4.  CD  Versus  \VS 


Simulation  is  performed  for  several  values  of  9  ranging  from  6  =  6  to  200  pages.  Small  values 
of  9  represent  the  case  of  high  memory  contention  characterized  by  a  relatively  high  rate  of  swap¬ 
ping.  Larger  values  of  9  are  used  to  evaluate  the  performance  of  CD  and  WS  when  there  is  enough 
memory  to  allocate  the  resident  sets  of  progiams  as  requested  by  CD  or  defined  by  r.  the  WS 
parameter  Four  levels  of  multiprogramming  (MPL)  are  used:  MPL  =  3,  4,  5  and  10.  MP1.=  10  is 
achieved  by  running  two  copies  of  each  program  simultaneously.  For  MPI.=3  high  memory  conten¬ 
tion  results  for  9  <  30  pages.  For  MPL  =  10.  memory  contention  is  observed  for  0  <  150  pages. 

Next  CD  is  compared  with  WS  in  terms  of  the  page  faults,  the  space  time  cost,  the  system 
ighput.  and  controllunihr. . 
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4.4.1.  Page  faults 

Minimizing  ihe  turn  around  time  of  a  job  is  a  primary  performance  objective  from  the  user's 
point  of  view.  In  a  virtual  memory  system,  this  objective  can  be  achieved  by  minimizing  the  page 
faults  of  a  user's  process.  However,  minimizing  the  faults  of  a  process  in  the  system  may  adversely 
affect  other  running  processes’  page  faults  and  worsen  the  overall  system  performance.  In  the  next 
subsection  we  study  the  page  fault  characteristics  when  the  objective  is  to  minimize  the  faults  of 
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individual  processes. 

4.4. 1.1.  Page  faults  of  individual  processes 

In  a  uniprogrammed  system.  WS  can  be  easily  tuned  to  achieve  the  absolute  minimum 
number  of  page  faults  by  choosing  a  relatively  large  value  for  r.  Earlier  experiments  [20],  [3]  have 
always  assumed  a  uniprogrammed  system  with  infinite  memory  where  r  value  is  not  restricted  by 
the  memory  size.  However,  in  practice  r  is  restricted  by  the  finite  memory  capacity  available  on 
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the  system.  In  a  multiprogramming  system  the  page  faults  of  a  process  is  affected  by  other 
processes  running  in  the  system:  therefore,  large  r  values  may  not  always  generate  a  low  number 
of  faults.  In  Chapter  2.  t  was  shown  that  increasing  r  may  result  in  increasing  the  number  of 
faults,  i.e..  anomalous  behavior.  Also.  larger  r  values  yield  large  working  set  sizes  which  lead  to  a 
memory  contention  problem  among  processes  in  the  system. 


Table  4-2a  (MAIN) 

CD  compared  with  the  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs 


Page 

-aults 

ST  Cost(  I0b) 

r 

0 

CD 

WS 

IIIIJ 

CD 

— 

WS 

A  ,-r% 

WS 

6 

923 

1743 

89% 

3. ,84 

9.07 

10 

923 

97S 

06% 

3. ,84 

50% 

6  1 

!  20 

872 

921 

06% 

51% 

6  ! 

25 

855 

5.71 

32% 

6 

ESI 

IVI4I 

424% 

6900 

1  100 

169 

302 

79% 

5.22 

BSH 

KESfl 

9900 

1 


Table  4-2b  (FIELD) 

CD  compared  with  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs 
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Page 

Faults 

ST  Cost(106) 

T 

0 

CD 

WS 

A 

CD 

WS 

A57-  % 

WS 

6 

173 

7903 

4468 

2.858 

31.6 

1005 

1 

7 

173 

3795 

2094 

2.858 

33.1 

1058 

6 

8 

173 

3172 

1733 

2.858 

27.2 

862 

386-436 

9 

173 

2892 

1572 

2.858 

26.2 

816 

441-co 

10 

173 

3357 

1840 

2.858 

34.1 

1093 

6 

11 

173 

2784 

1509 

2.858 

30.8 

978 

381 

12 

136 

1307 

861 

2.899 

11.9 

310 

261 

13 

136 

1217 

795 

2.899 

12.3 

324 

396-436 

14 

136 

1153 

748 

2.899 

12.1 

317 

771-1101 

15 

136 

1163 

755 

2.899 

12.3 

324 

221-226 

451-511 

16 

136 

1146 

743 

2.899 

13.1 

352 

386 

17 

136 

1101 

710 

2.899 

13.5 

366 

581-761 

18 

136 

341 

151 

2.899 

3.91 

35 

261 

,_J2_ 

136 

219 

61 

2.899 

3.1 

7 

381-396 

25 

143 

149 

4 

2.768 

2.98 

8 

1301-1501 

30 

136 

134 

00 

2.899 

2.62 

-9 

1601- 

35 

136 

104 

-23 

2.899 

3.03 

5 

1801-2301 

40 

136 

113 

-17 

2.899 

2.92 

1 

4201-5401 

45 

136 

107 

-21 

2.899 

3.04 

5 

921-961 

50 

136 

106 

-22 

2.899 

3.2 

10 

6501- 

100 

136 

-  69 

-49 

2.899 

3.68 

27 

4701-6001 

For  CD.  the  number  of  faults  is  a  function  of  0  only  ( FCD(G )).  although  0  is  not  a  control 
parameter.  In  this  study  we  use  a  wide  range  of  0  values  to  demonstrate  the  ability  of  each  policy 
to  function  in  small  and  large  memories.  For  each  0  value  CD  generates  one  set  of  results  including 
the  number  of  faults  for  each  process  and  for  the  system. 

The  WS  policy  is  controlled  by  r.  the  window  size.  Each  performance  index  is  a  function  of 
r.  For  each  0  and  each  r.  WS  generates  one  set  of  results.  Since  we  use  several  values  of  r.  several 
sets  ot  results  are  obtained.  The  minimum  r  value  used  is  r=l.  An  increment  of  5  is  used  up  to  a 
alue  oi  r=1000.  A  small  increment  is  necessary  to  capture  the  behavior  of  WS  in  transitional 
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periods.  In  numerical  programs,  changes  in  locality  structures  occur  in  abrupt  fashion;  this  is  obvi¬ 
ous  from  the  lifetime  of  individual  numerical  programs  (see  reference  [8]).  A  larger  increment  is 
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used  for  r>1000.  The  WS  window  size  is  increased  until  the  working  set  size  of  any  process 
exceeds  the  amount  of  physical  memory.  0.  where  an  overload  condition  is  raised;  in  this  case  the 
results  are  generated  for  all  preceding  r  values  and  the  simulation  is  terminated.  Simulation  may 
be  continued  only  with  larger  0  values. 

Each  r  value  is  used  by  all  processes  in  the  system  (fully  detuned  policy  [20]).  Alternatively, 
one  can  use  for  each  process  in  the  system  a  separate  r  which  optimizes  the  performance  of  the  par- 
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ticular  process  (fully  tuned  policy  [20]).  The  high  overhead  associated  with  fully  tuned  policy  res- 


Table  4-2c  UNIT) 

CD  compared  with  the  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs 


i  i 

STCost(106) 

r 

0 

CD 

WS 

Af% 

CD 

ws 

ws 

6 

2520 

3686 

46 

13.8 

24.5 

78 

196-376 

7 

2520 

3150 

25 

13.8 

24.3 

765 

256 

8 

2520 

3038 

21 

13.8 

31.7 

130 

6 

9 

2520 

2610 

04 

13.8 

28.4 

106 

6 

10 

2520 

2556 

02 

13.8 

28.7 

108 

6 

11 

2520 

2525 

00 

13.8 

28.9 

109 

6 

12 

2457 

2525 

03 

13.19 

29.0 

120 

6 

13 

2457 

2519 

03 

13.19 

43.2 

228 

11 

14 

2457 

2515 

02 

13.19 

44.3 

236 

11 

15 

2457 

2511 

02 

13.19 

45.6 

246 

11 

16 

2457 

2513 

02 

13.19 

46.0 

249 

11 

17 

2457 

13.19 

50.0 

279 

11-16 

18 

2514 

02 

13.19 

60.2 

356 

16 

20 

:1M 

2509 

02 

13.19 

46.5 

253 

11 

25 

945 

2506 

164 

5.16 

50.0 

j  8695 

1 1-16 

HE! 

jsza 

978 

03 

5.16 

35.6 

!  590 

81 

35 

945 

960 

02 

13.41 

33.2 

148 

66-86 

40 

945 

947 

00 

i  13.41 

48.7 

263 

121 

45 

945 

947 

00 

13.41 

47.2 

i  252 

116 

50 

369 

947 

157 

|  11.22 

61.0 

]  444 

156-161 

100 

273 

175 

-36 

15.47 

14.2 

-S 

!  516-1101 
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tricts  its  use.  To  achieve  a  performance  close  to  that  of  fully  tuned  WS  with  relatively  low  over¬ 
head  we  find  for  each  process  TF-mw  which  produces  the  minimal  number  of  page  faults 
(^min(®'T))-  recall  that  our  objective  is  to  minimize  the  turn  around  time  of  individual  processes. 
The  side  effects  of  operating  with  fF_min  are  measured  by  evaluating  the  corresponding  space  time 
costs  ST  )  and  the  average  working  set  size  W  (r ). 

For  MPL=3.  the  results  are  shown  in  Tables  4-2a.  2b.  2c.  for  programs  MAIN.  FIELD,  and 
INIT,  respectively.  In  the  first  column  of  each  table  is  the  memory  size,  0.  The  number  of  page 
faults  generated  under  CD  and  WS  are  given  in  the  next  columns:  for  WS  the  number  of  faults  is 
the  minimal  value  selected  from  several  values  generated  under  different  r  values.  The  relative 
difference  between  Fcd  and  is  given  by 


F'x'k  ~F co 

af  =  _1L — _xi  oo% 


(4-1) 


CD 


Positive  A.r  values  indicate  that  the  number  of  faults  under  WS  is  larger  than  that  under  CD.  For 
the  same  0.  the  space  time  costs  under  CD  and  WS  are  given  in  the  next  two  columns.  For  CD  this 
is  the  only  value.  For  WS  this  is  the  space  time  cost  achieved  using  TF^min  .  The  relative  difference 
between  ST^s  and  STCo  is  given  by 


ST yc  —STcd  r  |  \ 

A.sr=— - - —  *100%  •  U-2) 

O/  CD 

The  last  column  shows  the  optimal  r  for  each  process. 

The  analysis  of  Tables  4-2  shows  that  CD  performs  better  than  WS  in  high  memory  conten¬ 
tion  cases  (small  0  values):  high  memory  contention  is  characterized  by  high  swapping  activity. 
Consider,  for  example.  6  =  S.  The  minimal  faults  under  WS  for  programs  MAIN.  FIELD,  and  1N1T 
are  higher  than  those  achieved  under  CD  by  53%.  1  733%.  and  21%-,  respectively.  Lnder  CD.  4 
swapping  operations  are  performed  to  pre-empt  18  pages  of  memory,  whereas  under  WS.  more 
than  40  swapping  operations  are  initiated.  The  performance  of  WS  improves  when  the  memory 
available  on  the  system  is  relatively  large.  For  example.  WS  produces  36%  and  40%  less  faults  than 
CD  for  0=100  for  INIT  and  FIELD,  respectively  CD  still  outperforms  WS  for  program  MAIN  by 


'■V  .>  . 


117 


i 

i 


Table  4-3a  (MAIN) 

CD  compared  with  the  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs  (MPL=4.5.10) 


Page 

Faults 

STCost(106) 

7 

MPL 

0 

CD 

WS 

A  p% 

CD 

WS 

&ST  % 

WS 

4 

10 

923 

1469 

59% 

3.841 

7.59 

98% 

9 

20 

923 

947 

3% 

3.841 

5.87 

53 

10 

25 

923 

921 

00 

3.841 

5.71 

49% 

6 

30 

889 

921 

04% 

4.367 

5.81 

33% 

10 

40 

855 

921 

08% 

4.331 

5.83 

35%  . 

26 

50 

855 

921 

08% 

4.331 

5.83 

35% 

50 

100 

169 

919 

444% 

5.221 

15.6 

199% 

415 

150 

152 

258 

70% 

7.117 

23.4 

229 % 

6600 

200 

152 

79 

-48% 

7.117 

10.8 

52% 

16.500- 

5 

50 

855 

1157 

35% 

4.331 

11.9 

175% 

51 

100 

237 

921 

288% 

4.72 

9.92 

110% 

50-250 

150 

169 

310 

83% 

5.21 

22.9 

340% 

6000 

200 

152 

139 

-9% 

7.117 

14.4 

102% 

20.000- 

10 

50 

895 

1029 

20% 

4.334 

6.11 

41% 

11 

100 

237 

956 

303% 

4.72 

5.96 

26% 

51 

150 

245 

564 

130% 

8.113 

22.2 

174% 

6900 

200 

152 

474 

212% 

7.117 

23.1 

225% 

7900 

79 %.  For  0=100.  the  swapping  rate  is  0  under  both  CD  and  WS. 

The  improvement  of  WS  with  a  relatively  large  memory  size  (0=100  for  MPL=3)  is  expected 
since  the  working  set  size  of  a  program  can  grow  with  less  restriction.  Using  large  values  of  0  may 
result  in  a  situation  similar  to  a  uniprogramming  system  with  infinite  memory,  where  WS  can 
achieve  the  absolute  minimal  number  of  faults  by  using  a  relatively  large  7.  In  a  multiprogram¬ 
ming  system  it  is  always  possible  to  transfer  the  system  into  high  memory  contention  state  by 
increasing  the  number  of  processes  competing  for  memory  space  and  CPU  time.  i.e..  increasing 
MPL.  Comparing  CD  and  WS  for  small  0  values  can  be  a  useful  measure  of  the  optimal  MPL  sup¬ 
ported  by  both  policies.  Consider,  for  example,  me  performance  of  CD  and  WS  for  0=50.  For 
MPI.=3.  WS  generates  less  faults  than  CD  does  for  programs  MAIN  and  FIELD  by  44*1  and  22%. 


respect:  . e!y.  When  the  multiprogramming  level  is  increased  to  MPL=4.  WS  generates  more  faults 
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Table  4-3b  (FIELD) 

CD  compared  wiih  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs  (MPL-4.5.10) 


Page  Faults 

ST  CostClO6) 

6 

CD 

WS 

A  F% 

CD 

WS 

Asr  % 

10 

2501 

3909 

56% 

8.946 

35.5 

275% 

20 

173 

1806 

944% 

2.858 

21.7 

659% 

25 

136 

241 

77% 

2.899 

3.14 

08% 

30 

136 

193 

42% 

2.751 

2.81 

02% 

40 

136 

161 

18% 

2.899 

3.33 

15% 

50 

136 

161 

18% 

2.899 

3.55 

22% 

100 

136 

133 

00 

2.899 

3.76 

30% 

150 

136 

64 

-53% 

2.899 

4.25 

47% 

200 

136 

61 

-55% 

2.899 

4.09 

41% 

50 

136 

2762 

1931% 

2.899 

38.1 

1214% 

100 

136 

109 

-20% 

2.899 

3.67 

27% 

150 

136 

68 

-50% 

2.899 

3.8 

31% 

200 

136 

61 

-55% 

2.899 

4.09 

41% 

50 

135 

405 

200% 

2.762 

4.82 

43% 

100 

165 

164 

00 

2.852 

3.07 

08% 

150 

128 

114 

-11% 

2.893 

3.58 

24% 

200 

128 

83 

-35% 

2.893 

3.03 

05% 

_ 9_ 

15 

21 

30 

31 
45 

415_ 

6600 

10,500- 

101 

951 

5500 

15,000- 

21 

701 

7000- 

1800 


than  CD  by  8%  and  18%  for  MAIN  and  FIELD,  respectively.  Increasing  MPL  further  to  MPL-5 
and  10.  the  number  of  faults  under  WS  exceeds  that  under  CD  by  35%  and  20%  for  MAIN,  and  by 
1931%  and  200%  for  FIELD,  respectively.  The  results  for  MPL=4.  5.  and  10  are  reported  in  Tables 
4-3.  one  table  for  each  program. 

From  Tables  3  we  note  that  CD  benefits  from  increasing  MPL  for  the  same  0  value,  whereas 
the  performance  of  WS  degrades  with  increasing  MPL.  Consider,  for  example,  program  HWSCRT 
(Table  4-3-el.  Doubling  MPL  has  almost  no  effect  on  the  performance  of  CD.  whereas  the  page 
faults  under  WS  increased  more  than  12.  3.  and  2  times  for  0=100.  150.  and  200.  respectively. 

The  low'  number  of  page  faults  under  WS.  generated  with  larger  0  values,  is  almost  always 
associated  with  a  space  lime  cost  (ST)  larger  than  CD's,  in  other  words.  WS  generates  less  faults 
on  the  expense  of  occupying  more  memory  space  for  a  longer  time.  Consider,  for  example.  Table 
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4-3a  for  program  MAIN.  MPL-3.  and  0-50.  The  WS  policy  generates  44%  less  faults  than  does  CD 
(A.r=— 44%)  .  However,  the  space  time  cost  under  WS  is  4.24  times  more  than  that  under  CD 
(A ST  =424%).  For  program  CONDUCT  in  Table  4-3d,  WS's  improvement  over  CD  in  terms  of  page 
faults  for  \IPL-5  and  0-100.150.200  is  accompanied  by  excess  space  time  cost  of  25%.  186%.  and 
247%.  respectively.  On  the  other  hand.  STCD  is  lower  for  most  of  the  time  than  S7Vs  even  when 
CD  generates  fewer  faults  than  WS.  For  0  «  9.  in  Tables  4-3a.  3b.  3c.  CD  generates  less  faults  than 
WS  by  16%,  1572%  and  4%  for  MAIN.  FIELD  and  1NIT.  respectively.  For  the  same  0.  CD  outper¬ 
forms  WS  in  terms  of  ST  by  59%.  816%  and  106%  for  the  same  programs. 

The  analysis  of  Tables  2  and  3  show  that  CD  achieves  better  performance  than  WS  in  a  small 
memory  environment.  The  WS  is  a  better  policy  when  using  a  large  memory  size.  However,  for  the 
same  memory  size.  CD  can  support  higher  multiprogramming  levels.  CD  is  designed  to  respond  to 


Table  4-3c  (INIT) 

CD  compared  with  the  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs  (MPL-4.  5.  10) 


WS 

Af  % 

3298 

31% 

2544 

00 

2521 

00 

1113 

-55% 

997 

-61% 

977 

34% 

274 

-07% 

175 

-36% 

STCost(10b) 


200  !  273  I  175  I  -36% 


1 

'St 

f£f 

A 

$ 

v; 
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changes  in  the  memory  status  in  a  multiprogramming  system.  Both  the  hierarchical  structure  of 
memory  directives  and  the  partial  swapping  mechanism  enhance  the  performance  of  CD. 

In  the  above  analysis  we  have  assumed  that  each  process  can  use  its  own  optimal  r  (fully 
tuned  policy).  The  high  overhead  associated  with  this  policy  restricts  its  usage  in  real  systems. 
Choosing  one  r  among  the  optimal  ones  (p%  detuned  policy  [20])  may  degrade  the  overall  system 
performance.  Moreover,  an  optimal  r  for  one  process  may  not  be  usable  by  other  processes.  For 
example,  the  optimal  r  for  program  MAIN  (0=45)  is  6200.  This  r  cannot  be  used  by  INIT  since  the 
working  set  size  (a  function  of  r)  exceeds  the  available  memory  on  the  system;  V(r =6200.0=45)  = 
69  pages  >  0=45.  In  the  next  subsection  we  consider  optimizing  the  overall  system  page  fault  per¬ 
formance. 


Table  4-3d  (CONDUCT  ) 

CD  compared  with  the  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs  (MPL=4.  5.  10) 


ST  Cost(  10b) 


CD  WS  A,t  %  WS 


5043 

23005 

4788 

5507 

4634 

5148 

4456 

4952 

4125 

4876 

3873 

4637 

789 

2903 

356% 


15% 


96.96 


173.9 


268.0 


92.8 


5  50  4562 

100  789 

_ 150  748 

200 ~  748 

10  j  50  3873 

I  100  789 

I  150  748 

I  200  r48~ 


11%  236.0 


18%  284.9 


20%  328.4 


268%  I  301.7 


-46%  1  22~22~ 

31%  |  96=9= 

-04%  30. 1 7 

-22%  |  22.22 
-46%  1  ">7  ?? 


185.0  j 


203.0 


278.0 


421.0  i 


.6 

79.7  | 

160.0  | 

37.7  i 
63.5  ; 
77.1  : 


-29%  | 


-15% 


40% 


2 

259%  i 

AW  i 


15 


21 


60 


66 


116 


415 


6600 

20,000 

lof 

601 

6500 

50.000 


51%  ;  38.0 

?  1 8%  !  30. 1  7 


22.22 
v>  ■>■> 


j-  77.1  :  247%  j  50.000 

105.0  176%  i  21 

I  202.0  570%  i  451 

1  84. 8  i  282%  '  24.500 

35.3  59%  1000 


Table  4-3e  (HWSCRT) 

CD  compared  with  minimal  achievable  page  faults  under  WS 
with  corresponding  space  time  costs  (MPL=5.  10) 


Page  Faults 

STCostOO6) 

T 

MPL 

0 

CD 

WS 

Ajr% 

CD 

WS 

Ast-% 

WS 

5 

50 

649 

4744 

631% 

11.33 

84.4 

645% 

101 

100 

646 

378 

-42% 

11.33 

19.5 

72% 

401 

150 

646 

155 

-76% 

11.33 

13.3 

17% 

6500 

200 

646 

123 

-81% 

11.33 

9.43 

-16% 

10.000 

10 

50 

4680 

4684 

00 

11.33 

82  o 

632% 

71 

100 

649 

4580 

606% 

11.33 

188.0 

1559% 

651 

150 

646 

474 

-27% 

11.33 

23.2 

105% 

551 

200 

646 

340 

-47% 

11.33 

17.2 

52% 

551 

4.4. 1.2.  Overall  system  page  faults 


For  WS  we  find  one  global  r  which  minimizes  the  overall  system  page  faults.  We  then  use 
this  r  to  find  the  corresponding  page  faults  and  space  time  costs  of  the  individual  processes.  The 
results  for  MPL=*3  are  reported  in  Table  4-4.  In  Table  4-4  we  compare  the  minimal  overall  system 
and  the  corresponding  individual  processes'  page  faults  under  WS  with  page  faults  achieved  under 
CD.  The  space  time  costs  of  generating  the  given  fault  rale  performance  are  also  compared.  From 
Table  4-4  it  is  easy  to  see  that  CD  produces  less  faults  than  WS.  irrespective  of  the  maximum 
memory  available  on  the  system.  However,  the  performance  of  CD  is  much  better  than  that  of  WS 
wh  n  the  memory  contention  is  very  high.  For  0  =  6.  WS  generates  164%  more  faults  than  CD  does. 
For  9  =  25  CD  still  outperforms  WS  by  85%.  CD  also  outperforms  WS  on  the  individual  processes 
level.  For  9  =  50.  WS  generates  S%.  21%.  157%  and  50%  more  faults  than  CD  does  for  programs 
MAIN.  FIELD.  IN  IT  and  the  overall  system,  respectively. 

The  results  for  MPL=4.5  and  10  are  reported  in  Table  4-5.  For  MPL=4.  CD  outperforms  WS 
for  0<15O.  The  improvement  is  higher  for  smaller  9  values,  e.g..  188%  for  0=10.  Similarly,  tor 
MPI.=5.  CD  outperforms  WS  .'or  0 <  150:  A-  =8%  for  0=50  and  100.  For  MPI.  =  10.  WS  generates 
73%.  333%.  and  38%  more  faults  than  CD  for  0=50.  100.  and  150.  respectively.  Note  that  when 
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Table  4-4 

Optimizing  system  performance.  MPL-3 


Ar% 

As/-  % 

e 

MAIN 

FIELD 

1NIT 

System 

MAIN 

FIELD 

IMT 

System 

6 

85 

2464 

35 

164 

113 

733 

136 

215 

7 

65 

2094 

23 

133 

96 

1058 

123 

248 

8 

45 

1956 

21 

119 

83 

1048 

130 

248 

9 

16 

1864 

4 

95 

59 

1072 

106 

232 

10 

6 

1840 

1 

91 

50 

1093 

108 

235 

11 

5 

1836 

00 

89 

48 

1090 

109 

235 

12 

24 

2016 

3 

89 

66 

1356 

120 

362 

13 

28 

2110 

3 

91 

72 

1345 

228 

361 

14 

17 

2041 

2 

85 

60 

1363 

236 

367 

15 

5 

1930 

2 

77 

53 

1407 

246 

328 

16 

4 

1921 

2 

77 

52 

1407 

249 

380 

17 

20 

1163 

2 

52 

64 

869 

340 

365 

18 

6 

1151 

2 

48 

54 

1114 

356 

408 

20 

7 

1136 

2 

47 

54 

1176 

363 

384 

25 

9 

1076 

166 

85 

33 

1269 

1111 

762 

30 

8 

18 

6 

7 

35 

11 

200 

98 

35 

8 

18 

6 

7 

35 

14 

16 

31 

40 

9 

29 

1 

6 

53 

20 

138 

104 

45 

8 

18 

2 

5 

52 

24 

124 

95 

50 

8 

21 

157 

50 

54 

19 

288 

191 

100 

131 

-48 

-27 

15 

377 

18 

31 

106 

MPL  is  doubled  (from  5  to  10)  the  improvement  of  CD  over  WS  increases.  The  CD  policy  outper¬ 
forms  WS  for  MPL=5  and  0—50  by  only  8%;  however,  a  73 %  improvement  is  achieve  or  MPL=10. 
as  well  as  for  MPL=5  and  0=150.  WS  generates  less  faults  than  CD  by  36%;  for  me  same  0  value 
(  150)  and  MPL*10  th;  number  of  faults  under  CD  is  increased  from  2048  to  4515.  while  the  page 
faults  under  WS  increased  from  1303  to  6241.  i.e..  CD's  faults  increased  by  2.2  times  and  WSs 
faults  by  4.8  times.  The  outcome  is  a  CD  improvement  of  38%  over  WS.  For  0=200.  the  WSs 
improvement  over  CD  decreased  from  49%  for  \1PL=5  to  less  than  2%  for  MPL=10. 

As  has  been  concluded  from  the  analysis  of  individual  processes.  CD  is  more  capable  than  WS 
for  supporting  higher  MPL  for  the  same  memory  sire.  Recall  that  CD  torces  every  process  in  the 


system  to  run  with  minimal  memory  allocation  in  high  memory  contention  cases.  A  process 
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running  with  a  priority  index  P  >1  for  some  MPL.  could  be  forced  to  run  with  P=  1  (less  alloca- 
tion)  for  a  higher  MPL. 

The  second  major  column  in  Table  4-5  shows  the  excess  space  time  cost  that  WS  produces 
over  CD  for  the  overall  system  and  the  individual  processes.  The  very  large  ST  exhibited  by  WS 
does  not  reduce  the  fault  rate  of  WS  below  that  of  CD.  For  0  =  25  WS  produces  85%  more  faults 
than  CD.  and  S7Y*  is  higher  than  STCD  by  762%.  Together  with  the  results  in  the  previous  subsec¬ 
tion.  this  observation  suggests  that  CD  make  better  use  of  the  allocated  memory  over  execution 
time. 


4.4.2.  Space  time  cost 

Minimizing  the  fault  rate  under  WS  by  using  large  values  of  r  may  produce  high  space  time 
costs.  Therefore,  a  more  realistic  cost  measure  of  WS  policy  is  the  space  time  cost.  In  fact.  WS  is 
advocated  as  a  near  optimal  policy  in  terms  of  minimizing  space  time  costs.  Moreover.  ST  has  been 
used  to  control  the  system  throughput.  A  maximum  throughput  is  claimed  to  be  achieved  when  the 


Table  4-5  (SYSTEM:  MPL=4.  5.  10) 

CD  compared  with  the  minimal  achievable  page  faults  under  WS 
with  corresponding  space  lime  costs 


0 

10 

50 

1 

00 

Li 

50 

u 

00 

i 

50 

1 1 

00  1 

ST  CostdO7) 
CD  WS 


Page  Faults 

ws  cd" 


A,r% 


12.34 

4.77 
5.29 

4.77 

4.77 


170% 

562% 

760% 

ma 


31681 

6858 

4268 

1117 

725 


10987 

5643 

1391 

1309 

1309 


188% 

22% 

207% 

-19% 

-45% 


j  150 
j  200 

7  50 
I  100 
i  150 
!  2  (JO 


12.05  j  30.4 
5.S4  |  9.44 

5.09  |  11.8 
5.28  i  12.8 


11875  j 
2489  j 
1303  | 
991  ! 


6931  |  72% 

2081  j  20% 
1972  1  -34% 
1955  I  -50% 


_ 9_ 

116 

6200 

0200~ 

20,000 

31 

601 
6500 
25.000 1 


13.62  '■  49.1 
13.54  88.1 

11.13  !  27.5 
12.17  22.7 


260% 
s  s  1  CT, 


29332  | 
228 70  ! 
6241  | 
4055  ; 


12174  |  141% 
4226  441% 

4080  I  53% 
3*^4  02% 
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space  time  cost  is  minimized  [  1 2].  [20],  In  this  subsection  we  compare  minimal  space  time  costs 
achieved  under  WS  with  those  achieved  under  CD  for  different  values  of  0. 

The  results  for  individual  processes  are  reported  in  Tables  4-6a-e  for  MPL=4.  5  and  10.  For 
each  process  we  find  which  minimizes  the  space  time  cost  of  that  process.  The  space  time 

costs  and  the  number  of  page  faults  generated  using  Tsr_min  are  compared  with  the  space  time  costs 
and  number  of  page  faults  generated  under  CD.  The  relative  difference  between  S7Ys-  and  STCD  is 
given  by  AiT  in  Equation  (4-2):  Ar  is  given  in  Equation  (4-1).  Positive  AnT  and  Af  mean  that  WS 
has  a  higher  space  time  cost  and  generates  more  faults  than  CD.  The  value  Af  is  used  to  study  the 
time  cost  due  to  running  each  process  at  its  minimal  space  time  cost.  A  low  space  time  cost  may 
result  from  using  relatively  small  memory  at  the  expense  of  generating  many  faults. 


Table  4-6a  (MAIN  ) 

CD  compared  with  the  minimal  achievable  space  time  cost  under  WS 
with  corresponding  page  faults  (MPL=4.  5.  10) 


ST  CostdO6) 

Page  Faults 

MPL 

6 

CD 

WS 

A.sy  Tj 

CD 

ws 

A/r  % 

r 

50  j  4.33 


10O  I  4.72 


!  8. 


200  I  7.12  |  5.S6 


349  855 


249  I  237 


4.33  i  6.11 


3  824  !  5.96 


s  i i  s 


855  92 1 


237  I  921 


92 


152  i  921 


895  1029 


906  i  956 
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Table  4-6b  (FIELD) 

CD  compared  with  minimal  achievable  space  time  costs  under  WS 
with  corresponding  page  faults  (MPL=4,  5.10) 


ST  Cost(10b) 


0 

CD 

WS 

CD 

W'S 

A  iT  % 

r 

10 

8.95 

32.3 

4546 

261% 

5 

20 

2.85 

21.7 

660% 

173 

1806 

944% 

15 

25 

2.90 

3.14 

08% 

136 

241 

77% 

21 

30 

2.75 

2.81 

03% 

136 

193 

42% 

30 

40 

2.90 

3.11 

08% 

136 

161 

18% 

21 

50 

2.90 

3.03 

05% 

136 

166 

22% 

86 

100 

2.90 

3.07 

06% 

136 

161 

18% 

20 

150 

2.90 

3.19 

10% 

136 

72 

-47% 

24.000 

200 

2.90 

3.53 

22% 

97 

-29% 

1000 

3.01 

04% 

136 

214 

57% 

61 

100 

2.90 

3.14 

09% 

136 

168 

247c 

151 

150 

2.90 

3.42 

18% 

136 

72 

-477c 

3500 

200 

!  2.90 

3.52 

22% 

136 

97 

-29% 

951 

50  2.76 


100 


2.89 


200  2.89 


75% 

135 

405 

200% 

21 

087c 

165 

00 

700 

12% 

128 

167 

307c 

201 

05% 

128 

S3 

-35% 

1800 

Tables  4-6a-e  show  that  CD  has  considerably  lower  space  time  than  WS.  Consider  program 
MAIN.  For  \1PL=4.  5T;<  is  larger  than  STiD  for  all  0  values  except  0=150.  However,  for  0=150 
WS  gererates  5  times  more  page  faults  than  CD  in  order  to  achieve  1 8°T  less  space  time  cost.  Simi¬ 
larly.  tor  MPL=5.  ST is  larger  than  STCD  for  0=50  and  100.  Note  that  the  low  space  time  cost 
under  CD  is  not  achieved  at  the  expense  of  a  large  number  of  page  faults:  for  0=100  ST,  D  is  24% 
less  than  S7'-x<  and  C,  is  almost  3  limes  less  than  .  A  low  ST  cost  under  CD  is  due  to  a  rela- 
'-i'- ely  lower  page  fault  number  and  a  relatively  lower  memory  consumption.  In  Table  4-6b.  the 
results  are  shown  L~>r  program  F  IELD.  The  space  lime  cost  under  CD.  .ST,  -> .  is  lower  than  ST .  for 
ail  0  and  MPI.  values  For  0=150  and  2oo.  W'S  achieves  a  lower  number  of  page  faults  than  CD.  For 
'Uvh  large  values  nf  0.  W'S  can  use  a  large  r  value  to  generate  a  minimum  number  of  faults.  CD. 


now  ever,  achieves  a  minimum  number  ot  faults  for  mucn  smaller  0  values,  e.g..  0=25  lor  MPI.  =  4 
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Table  4-6c  (IMT) 

CD  compared  w ith  minimal  achievable  space  time  costs  under  WS 
with  corresponding  page  faults  (MPL=4.  5,  10) 


m 

r; 


STCost(10fa) 

Page 

Faults 

MPL 

0 

CD 

WS 

Air* 

CD 

WS 

A  f% 

T 

5 

50 

5.04 

17.8 

253* 

729 

1016 

39% 

31 

100 

9.2S 

10.7 

15% 

273 

282 

03% 

601 

150 

9.28 

9.44 

02% 

273 

178 

-35% 

501 

200 

9.28 

9.44 

02% 

273 

178 

-35% 

501 

10 

50 

11.74 

44.0 

275% 

535 

2634 

392% 

11 

100 

10.7 

16.7 

56% 

273 

523 

92% 

551 

150 

3.15 

9.43 

199% 

273 

215 

-21% 

601 

200 

9.33 

9.33 

00 

273 

184 

-33% 

551 

and  0=50  for  \1PL=5  and  10.  Similarly  for  program  1XIT.  WS  achieves  lower  fault  number  than 
CD  when  CD  achieves  lower  space  time  cost  for  0=150,  and  200  for  MPL=5  and  10.  Again  this  is 
because  WS  can  generate  a  close  to  the  minimal  page  fault  number  by  using  a  relatively  large  r. 
The  virtual  size  of  IMT  is  175  pages:  WS  generates  178  pages  for  0^500.  The  CD  policy  achieves 
273  faults  at  its  best.  However.  CD  still  has  a  lower  space  time  cost  than  the  minimal  achievable 
under  WS. 


$ 


P 


Tables  4-6  show  that  the  space  tine  cost  of  WS.  when  WS  is  properly  tuned,  is  considerably 
larger  than  the  space  time  cost  of  CD  for  most  of  the  time.  Even  when  WS  has  a  lower  space  time 
cost,  its  page  faults  number  is  higher  than  CD's  and  the  low  ST  is  mainly  due  to  small  memory 
consumption.  Our  results  show  that  WS  is  not  optimal  in  terms  of  minimizing  fault  rate  as 
claimed  in  [20],  However.  CD  remains  to  be  compared  with  DM1N  to  show  how  close  to  optimal  it 
can  generate  a  space  time  cost. 


In  a  multiprogramming  system,  minimizing  the  space  time  cost  of  individual  processes  may 
not  serve  the  purpose  of  optimizing  t he  system  performance.  It  would  have  been  \  ery  helpful  if  the 
processes  in  the  system  utilized  one  r  to  achieve  their  minima;  space  time  cost.  Graham,  and  Den¬ 
ning  [26]  Gaim  that  ail  processes  in  the  system  can  use  one  r  to  achieve  a  space  time  cost  within 


■ 
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Table  4-6d  (CONDUCT) 

CD  compared  with  the  minimal  achievable  space  time  cost  under  WS 
with  corresponding  page  faults  (MPL-5,  10) 


MPL  9 

5  r~5' 


STCost(l06) 

Page  Faults 

9 

CD 

WS 

&  ST 

CD 

WS 

Af% 

r 

50 

96.9 

106.0 

09% 

4562 

5144 

13% 

21 

100 

30.17 

37.7 

25% 

789 

754 

-04% 

601 

150 

22.22 

30.17 

36% 

748 

611 

-18% 

601 

200 

22.22 

30.17 

36% 

748 

611 

-18% 

601 

50 

38.0 

106.0 

179% 

3873 

5144 

33% 

21 

100 

30.17 

37.7 

25% 

789 

754 

-04% 

601 

150 

22.22 

30.17 

36% 

748 

611 

-18% 

601 

200 

22.22 

33.70 

52% 

748 

679 

-09%. 

801 

Table  4-6e  (HWSCRT) 

CD  compared  with  the  minimal  achievable  space  time  cost  under  WS 
with  corresponding  page  faults  (MPL=5.  10) 


ST  Cost(  106) 

Page 

Faults 

9 

CD 

WS 

V; 

<1 

CD 

WS 

A  r% 

r 

, 

50 

11.33 

23.10 

104%, 

649 

5766 

7887c 

1 

100 

11.33 

19.50 

72% 

646 

378 

-41% 

401 

150 

11.33 

13.3 

18% 

646 

155 

-76% 

6500 

200 

11.33 

13.5 

19% 

646 

188 

-70% 

30.000- 

50 

11.33 

23.1 

104% 

649 

5766 

788% 

1 

100 

19.23 

23.1 

20% 

649 

5766 

788% 

1 

150 

19.28 

20.8 

08% 

646 

486 

-25% 

401 

200 

19.28 

15.1 

-22% 

646 

347 

-46%- 

451 

10%  of  the  minimal  space  time  cost  (10%  detuned  policy).  The  goal  is.  therefore,  to  minimize  the 
overall  system  space  time  cost,  assuming  that  individual  processes  are  within  P%  of  their  minimal 
ST  values.  In  Table  4-7  the  minimal  system  space  time  cost.  S7\..s  .  under  WS  is  compared  with 
5/  (  D  ■ 

For  WS.  we  find  a  window  size.  T<y< _>T .  which  minimizes  the  overall  system  space  lime 
cost.  ST sys w  hich  is  compared  with  STCD .  The  number  of  page  faults  generated  using 
v_c—  is  also  fe  nd  and  compared  with  F<  D  ■  Table  4-7  shows  that  CD  outperforms  WS  by  a 
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Table  4-7  (SYSTEM) 

CD  compared  with  minimal  achievable  space  time  costs  under  WS 
with  corresponding  page  faults  (MPL=4.  5,  10) 


great  margin,  especially  for  0=10.  50.  and  100.  Note  that  the  improvement  of  CD  over  WS 
increases  with  increasing  MPL  for  the  same  0  values.  For  instance,  for  0=150.  CD  outperlorms  WS 
by  31%.  58 and  147%  for  MPL=4.  5.  and  10,  respectively.  The  CD  policy  achieves  lower  faults 
numbers  than  WS  for  all  0  and  MPL  values  exclusively.  The  negative  and  values  in  Tables 
4-6  disappear  in  Table  4-7.  indicating  that  a  process  may  have  a  lower  space  time  cost  under  WS 
than  ST  under  CD  at  the  expense  of  some  other  process  in  the  system. 

The  corresponding  ST's  and  page  faults  for  individual  programs  are  found  when  the  overall 
system  ST  is  minimized,  using  r<ys  _<r  _„lm  .  These  values  are  compared  with  ST,  p  and  Fcd  f°r 
individual  processes.  The  results  are  reported  in  Table  4-8  for  MPL=3.  Tabie  4-8  shows  that  CD 
outperforms  WS  at  the  individual  process  level  when  the  overall  system  performance  is  be  ng 
optimized. 
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Table  4-8 

Optimizing  system  performance.  MPL=*3 


A  p  % 

A 57* 

9 

MAIN 

FIELD 

IN'IT 

System 

MAIN 

FIELD 

IN  IT 

System 

6 

85 

2464 

35 

164 

113 

733 

136 

215 

7 

65 

2094 

23 

133 

96 

1058 

123 

248 

8 

45 

1956 

21 

119 

83 

1048 

130 

248 

9 

16 

1S64 

4 

95 

59 

1072 

106 

232 

10 

6 

1840 

1 

91 

50 

1093 

108 

235 

11 

5 

1836 

00 

89 

48 

1090 

109 

235 

12 

24 

2016 

3 

89 

66 

1356 

120 

362 

13 

28 

2110 

3 

91 

72 

1345 

228 

361 

14 

17 

2041 

2 

85 

60 

1363 

236 

367 

15 

5 

1930 

2 

77 

53 

1407 

246 

328 

16 

4 

2 

77 

52 

1407 

249 

380 

■a 

20 

1163 

2 

52 

64 

869 

340 

365 

mm 

6 

1151 

2 

48 

54 

1114 

356 

408 

20 

7 

1136 

2 

47 

54 

1176 

363 

384 

25 

9 

166 

85 

33 

1269 

1111 

762 

30 

8 

18 

6 

7 

35 

11 

200 

98 

35 

8 

18 

6 

7 

35 

14 

16 

31 

mm 

9  |  29 

1 

6 

53 

20 

138 

104 

mm 

8 

18 

2 

5 

52 

24 

124 

95 

MM 

8 

21 

157 

50 

54 

19 

288 

191 

100 

131 

-48 

-27 

15 

377 

18 

31 

106 

4.4.3.  System  throughput 

A  major  design  goal  in  a  multiprogramming  system  is  to  maximize  the  number  of  jobs  completed 
per  unit  time.  i.e..  the  system  throughput  (<J>).  In  Table  4-9.  the  maximum  throughput  achieved 
under  W'S  (4> •*•>•)  is  compared  with  the  throughput  under  CD  (<t>t  ^  )  for  MPL=3.5.10.  The  relative 
difference  (A*)  between  the  W'S’s  maximum  throughput  and  CD's  throughput  is  given  by 


A*  = 


& cd  $  i  < 


Xiu)c(  . 


(4-3) 


Table  4-9  shows  that  CD  outperforms  \VS  by  a  large  margin,  especially  for  smaller  values  of  0. 
Consider,  for  example.  MPL=3.  For  0  =  6,  CD  has  a  higher  throughput  than  \VS  by  a  factor  of  — 
15.  For  9  -  100,  CD  achieves  a  13ft  higher  throughput  than  W'S.  For  MPL=10.  CD  achieves  higher 
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throughput  than  WS  by  87%  and  27%  for  0-100  and  150.  respectively.  The  results  suggest  that 
CD  outperforms  WS  when  the  memory  is  highly  utilized. 


Table  4-9 

CD  compared  with  the  maximum  achievable  throughput  under  WS 


4.4.4.  Controllability 


In  a  multiprogramming  system  it  is  necessary  to  tune  WS  policy  in  order  to  find  a  suitable  T 
to  achieve  a  desired  goal.  For  CD  this  problem  does  not  exist  since  the  directives  are  inserted  at 
compile  time  and  executed  as  part  of  the  code  at  run  time.  Memory  allocation  is  performed  dynam¬ 
ically  as  the  directives  are  received  by  the  operating  system.  Finding  the  appropriate  r  in  the  WS 
case  can  be  a  tedious  problem  for  several  reasons. 

The  first  reason  is  the  anomalous  behavior  of  WS's  fault  rate  function  discussed  in  Chapter  2. 
With  the  the  existence  of  anomalies,  fault  rate  reduction  is  not  always  achievable  by  increasing  T. 
Instead,  the  fault  rate  may  increase.  Moreover,  the  fault  rate  anomalies  distort  the  shape  of  fault 
rate  function  curves  and.  hence,  the  lifetime  curves.  Life  time  curves  because  of  the  anomalies  do 
not  exhibit  well  defined  knees:  knees  in  a  lifetime  curve  are  essential  for  the  primary  knee  criterion 
[20].  The  primary  knee  criterion  suggests  that  the  primary  knee  of  a  lifetime  curve  is  approxi¬ 
mately  associated  with  the  minimum  space  time  cost  point:  i.e..  by  using  r  where  the  primary  knee 
occurs  a  process  would  be  running  with  minimal  space  lime  cost.  For  this  reason.  Denning  rejects 
lifetime  models  of  program  behavior  if  they  do  not  exhibit  knees  [20],  With  the  existence  of  t-F 
anomalies,  it  is  not  obvious  how  or.e  would  locate  the  knees  of  a  lifetime  curve.  The  primary  knee 
criterion,  therefore,  may  not  be  useful  for  controlling  WS. 

The  second  reason  is  the  difficulty  of  controlling  the  policy  to  produce  the  maximum  possible 
throughput.  It  has  been  assumed  [  1 2 ]  that  the  maximum  throughput  is  achieved  by  minimizing  the 
space  time  cost.  The  average  ST  of  a  process  in  the  system  is  gf  en  by 

ST  =  .  ( 4-4 ) 

,V 

where  N  is  the  number  of  jobs  in  the  system  and  T  is  the  total  elapsed  time  of  all  the  jobs  in  the 
system.  The  throughput.  <f>.  is  given  by 


<t>= 


A 


u.nd  so 


ST  = 


9xT 

4>xr 


4-5) 


liquation  4-5  implies  that  a  maximum  throughput  can  be  achieved  if  each  process  in  the  system 


I 


operates  at  its  minimal  space  time  point,  or.  equivalently,  minimizing  the  overall  system  space  time 
cost.  This  argument  is  not  realistic  for  two  reasons.  First,  the  above  formula  assumes  that  the 
memory  space,  0.  is  completely  utilized.  This  assumption  is  not  always  true,  especialiy  for  large 
values  of  9. 

The  second  reason  is  that  minimizing  the  space  time  cost  of  each  process  does  not  necessarily 
minimize  the  overall  system  ST.  Each  process  may  have  its  own  optimal  r  which  differs  from  those 
used  by  other  processes  in  the  system.  The  assumption  that  the  space  time  cost  has  a  Bat  minimal 
region,  meaning  that  a  wide  range  of  r  can  minimize  the  space  time  cost,  has  been  shown  to  be 
optimistic  for  individual  programs  running  in  a  single  programming  machine  [3].  [6].  [8J. 

Our  results  also  show  that  each  program  may  use  a  different  optimal  r.  For  example,  for 
MPL=3  and  9  =  50.  three  values  of  t  (r  =  6,  316,  7000)  are  needed  to  minimize  the  ST  of  MAIN. 
FIELD  and  INIT.  respectively.  Tables  4-6  further  illustrate  this  fact.  For  example,  for  MPL=5  and 
0=150.  five  values  of  r  (51.  501.  601.  3500.  6500)  are  used  by  programs  MAIN.  INIT.  CONDUCT. 
FIELD.  HWSCRT.  respectively.  Similarly,  for  0=200  and  MPL=200.  five  values  of  r  are  used  (51. 
451.  551.  601,  1800).  Furthermore,  a  process  using  an  optimal  r,  r[VI _jpl  from  the  system's  stand 
point,  may  run  with  a  relatively  large  space  time  cost  compared  to  its  local  minimum  space  time 
cost  with  rmm.  In  Table  4-10  we  show  for  each  program  both  space  time  cost  values  ST(rsyt^,jpl  ) 
and  S7'  ( r!nin ).  The  relative  difference  between  these  values  is  given  by 

a  =  5r(T”<7;)~5r(Tmip)  xioo%  . 
mm 

In  Table  4-10  the  optimal  r  for  which  the  system  space  time  cost  is  minimized  is  601  for  MPL=10 
and  0=200.  For  the  moment  we  assume  that  it  is  possible  to  find  an  optimal  r  value  which  minim¬ 
izes  the  space  time  cost  of  each  process  or  the  space  time  cost  of  all  the  processes  in  the  system.  The 
question  is  whether  using  this  r  achieves  a  maximum  throughput;  i.e..  is  this  an  optimal  r ?.  Equa¬ 
tion  l4-5l[  12],  [20]  gives  a  positive  answer  to  this  question.  We  have  argued  that  this  is  true  only  if 
the  memory  is  completely  utilized.  Our  results  show  that  for  underutilized  memory  the  real  max¬ 
imum  throughput  can  deviate  from  the  throughput  achieved  by  using  r  which  minimizes  the  space 


Table  4-10 

Relative  difference  between  global  ST  and  local  ST  for  each  process 


MPL-10;  0=200;  r„p,  =601 

Program 

Tmin 

srmjm 

S7*0„(lOb> 

MAIN 

51 

5.86 

19.5 

233% 

FIELD 

1800 

3.03 

3.42 

13% 

1NIT 

551 

9.33 

9.71 

04% 

HWSCRT 

801 

33.7 

40.1 

20% 

CONDUCT 

WEB* 

15.1 

18.9 

25% 

time  cost  by  almost  a  factor  of  2.  In  Table  4-11  we  show  the  relation  between  the  maximum 
throughput  4>max  and  the  throughput  achieved  at  the  minimum  space  time  cost  point  Srmm •  The 
relative  difference  between  these  two  values  is  given  by 


A=- 


♦.tr. 


-xioo%  . 


Table  4-11  shows  that  for  relatively  small  memory  sizes  the  minimum  space  time  cost  and  the 
maximum  throughput  are  achieved  by  using  the  same  r.  See  in  Table  4-11  the  entries  for  0=6.  10 
for  MPL=3;  0=10.  20.  30.  40  for  MPL=4:  0=100  for  MPL-5:  and  0=150  for  MPL-10.  However,  for 
larger  values  of  0.  <J>max  deviates  from  _,„m  by  a  large  percentage.  For  example.  A  =167%  for 
0=100  and  MPL=3:  For  0=200.  A=136%.  106%.  15%  for  MPL-4.  5.  and  10.  respectively.  It  is 
worthwhile  to  mention,  however,  that  WS  has  a  poor  performance  compared  to  CD  when  the 
memory  is  highly  utilized:  small  values  of  0  in  Tables  4-2.  4-3.  4-4. 


4.5.  Summary  and  Conclusions 

We  have  presented  in  this  chapter  performance  measurements  on  program  behavior  in  mul¬ 
tiprogramming  systems.  Program  traces  are  simulated  in  a  multiprogramming  system  under  CD.  a 
compiler  directed  memory  management  policy,  and  WS.  a  dynamic  policy.  We  have  compared  the 
performance  of  CD  with  that  of  WS  since  the  latter  has  been  claimed  [20]  to  outperform  other 
existing  policies.  Four  characteristics  of  multiprogramming  virtual  memory  systems  have  been 
investigated:  page  faults,  space  lime  cost,  system  throughput,  and  controllability . 
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Table  4-1 1 

Maximum  throughput  versus  throughput  at  STmin  under  WS 
I  I  Throughput  <t>  (10~7)  I  I 


0.26 


2.16 


2.79 


2.8 


2.89 


4.11 


7.05 


7.12 


7.20 


21.1 


28.6 


0.26 


2.16 


2.2 


2.2 


2.2 


2.2 


7.05 


7.05 


7.05 


7.05 


10.7 


10 

.63 

.63 

00 

20 

1.77 

1.77 

00 

25 

2.23 

1.84 

22% 

30 

2.70 

2.68 

00 

40 

2.76 

2.75 

00 

50 

2.88 

2.75 

05% 

100 

4.63 

2.75 

68% 

150 

16.7 

10.4 

61% 

200 

24.5 

10.4 

136% 

2.09 


9.65 


17.8 


22.9 


1.29 


9.65 


11.1 


11.1 


1.69 


2.08 


1.69 


2.17 


7.76 


11.7 


the  results  reported  in  this  chapter  show  that  CD  outperforms  WS  by  a  fairly  large  margin, 
especially  when  the  memory  is  highly  utilized.  CD  is  able  to  dynamically  allocate  memory  space 
according  to  the  need  of  a  running  program,  the  available  memory  space,  and  the  need  of  other 
processes  in  the  system.  The  outcome  of  this  facility  is  a  relatively  low  fault  rale  at  a  relatively 
lose  memory  space  cost  and.  hence,  a  low  space  time  cost.  More  importantly.  CD  is  shown  to  have  a 


higher  throughput  than  WS. 

We  have  also  illustrated  that  WS  lacks  controllability  while  CD  does  not  exhibit  controllabil¬ 
ity  problems  at  all.  CD  is  a  parameterless  policy  while  WS  has  a  parameter,  r.  which  needs  to  be 
tuned  in  order  to  achieve  a  desired  goal.  It  is  necessary,  for  instance,  to  find  r  that  minimizes  the 
space  time  cost  in  order  to  maximize  system  throughput  [20].  However,  it  is  not  obvious  how  one 
would  choose  the  right  r  to  minimize  ST,  even  using  the  primary  knee  criterion  [l],  [12].  The  pri¬ 
mary  knee  criterion  is  difficult  to  apply  due  to  r-F  anomalies  exhibited  by  WS.  See  Chapter  2.  In 
any  case,  we  showed  that  using  an  optimal  r  which  minimizes  the  space  time  cost  does  not.  neces¬ 
sarily.  maximize  the  throughput.  ST min  maximizes  the  throughput  only  when  the  memory  is  com¬ 
pletely  utilized,  but  then  WS  has  a  poor  performance  compared  to  CD. 
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CHAPTER  5 


CONCLUSIONS 


5.1.  Summary  of  Results 


A  new  approach  to  the  management  of  numerical  programs  in  virtual  memory  systems  is 
presented  in  this  study.  We  have  presented  a  compiler  directed  policy  (CD)  which  incorporates  two 
memory  directives:  1)  ALLOCATE  and  2)  LOCK  and  UNLOCK.  ALLOCATE  estimates  the 
memory  requirements  ol  a  process  at  compile  time.  Memory  requirements  are  passed  to  the  operat- 
mg  system  at  run  time  through  two  primitives:  the  amount  of  memory  requested  and  the  priority 
ol  the  request.  The  CD  plicy  is  designed  to  dynamically  adjust  a  program's  memory  allocation 
according  to  the  status  of  the  available  free  memory  on  the  system  which  dynamically  changes  as 
processes  acquire  and  release  memory  space.  For  this  purpose.  CD  incorporates  a  swapping  mechan¬ 
ism.  Subprogram  control  structures  are  handled  dynamically  at  run  time,  thus  enabling  the  prepro¬ 
cessor  at  compile  lime  to  consider  each  subroutine  as  a  whole  unit. 

The  performance  of  CD  is  evaluated  using  a  trace  driven  simulator  of  a  multiprogramming 
system.  Traces  of  numerical  programs  are  used  in  the  experiments.  The  performance  of  CD  is  com¬ 
pared  to  the  performance  of  WS  policy.  The  results  reported  in  Chapter  4  show  that  CD  is  superior 
to  WS  in  high  memory  contention  cases.  The  CD  policy  produces  lower  fault  rates  and  lower  space 
lime  costs  than  WS.  and  therefore,  achieves  higher  throughput.  As  a  result.  CD  is  able  to  support 
higher  multiprogramming  levels  tor  a  given  size  of  physical  memory. 

We  have  presented  evidence  in  this  study,  that  W'S  has  a  controllability  drawback.  In  Chapter 
2.  we  reported  empirical  results  on  the  W'S  anomalies.  The  anomaly  types  exhibited  by  WS  are 
related  directly  to  the  WS  control  parameter,  the  window  size  t.  Thus,  tuning  W'S  to  achieve  a 
desired  performance  is  not  always  attainable  because  ol  the  anomalous  behavior.  The  anomalv 
tvpes  reported  in  this  thesis  are  not  exhibited  by  W'S  when  tested  in  a  uniprogramming 
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environment.  The  results  suggest  that  conclusions  based  on  experiments  with  individual  single  pro¬ 
grams  should  not  be  used  in  a  simplistic  manner  in  multiprogramming  systems.  It  has  also  been 
observed  that  it  is  not  possible  to  find  a  single  value  for  the  control  parameter  which  can  be  used 
by  every  process  in  the  system. 

On  the  other  hand.  CD  exhibits  no  controllability  problems  and  has  no  control  parameter. 
Memory  requests  issued  upon  executing  a  directive  are  processed  by  the  operating  system,  granted 
or  rejected,  according  to  the  available  free  memory. 

In  conclusion,  this  thesis  has 

( 1)  presented  CD.  a  compiler  directed  memory  management  policy  for  numerical  programs. 

(2)  shown  that  WS  exhibits  anomalies  in  multiprogramming  systems,  otherwise  unpredicted  from 
experiments  with  uniprogramming  systems. 

(3)  shown  that  CD  outperforms  WS  by  a  relatively  large  margin. 

5.2.  Suggestions  for  Future  Research 

The  compiler  directed  policy  presented  in  this  thesis  applies  only  to  numerical  programs.  The 
extension  of  CD  to  other  program  categories  is  essential  before  such  an  approach  to  the  memory 
management  problem  can  be  adopted.  The  locality  characteristics  of  different  application  programs 
have  to  be  understood  thoroughly  before  memory  directives  can  be  designed.  Typical  applications 
are  data  base  systems,  system  programs,  and  AI  application  programs.  The  compiler  directed  policy 
is  designed  for  single  processor  machines.  However,  the  ideas  used  in  this  thesis  can  be  useful  in 
pursuing  similar  techniques  in  multiprocessing  systems. 

The  performance  of  CD  compared  to  WS.  although  the  latter  is  claimed  to  be  the  best  nonloo¬ 
kahead  policy  [20],  is  not  sufficient  to  evaluate  the  performance  of  CD, which  should  be  compared 
to  other  dynamic  policies  such  as  PFF.  global  IRC.  and  global  CLOCK.  Moreover,  it  is  essential  to 
evaluate  the  performance  of  CD  when  comparing  it  with  the  optimal  policies.  For  instance.  CD 
should  be  compared  with  DMIX  [  1 0] .  winch  generates  the  absolute  minimum  space  time  cost. 
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Also,  we  feel  that  performance  evaluation  techniques  for  virtual  memory  systems  should  be 
upgraded  to  include  multiprogramming  specific  characteristics.  For  instance,  one  should  be  able  to 
measure  the  influence  of  one  program  on  the  rest  of  the  programs  in  the  system.  This  is  necessary 
for  scheduling  strategies.  The  techniques  developed  in  this  study  can  also  be  used  to  enhance 
scheduling  strategies,  especially  in  allocating  time  slots  to  running  processes. 

Finally,  the  main  issue  which  remains  to  be  pursued  is  the  issue  of  implementation.  The  com¬ 
plexity  of  such  a  problem  lies  in  the  fact  that  CD  has  to  be  incorporated  into  both  the  compiler  and 
the  operating  system.  Furthermore,  some  architectural  features  are  necessary  to  implement  CD. 
particularly  at  the  processing  stage  of  a  directive.  Therefore,  an  integrated  approach  to  the  design  of 
computer  systems  is  necessary  for  CD  to  be  implemented  in  real  systems. 
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